Pengenalan

Saya Briandamar Kencana seorang data enthusiast. Saya lulus dengan gelar sarjana statistik. Saya memiliki beberapa pengalaman kerja di bidang data, seperti magang di momobil.id, independent research and advisory Indonesia, staf data analis di Alif Aza Asia dan asisten peneliti di Bank Indonesia, dan kemudian saya melanjutkan karir saya sebagai analis di sebuah bank komersial di Indonesia. Selain itu, saya telah berpartisipasi dalam beberapa kursus pelatihan data di Algoritma Data Science School, Coursera, Udemy dan Dicoding.

Informasi Data

Kumpulan data ini diunduh dari situs web “kaggle”, berupa Kumpulan data terkait dengan data tenggelamnya kapal terkenal dalam sejarah yaitu titanic

Permasalahan

Tenggelamnya Titanic adalah salah satu bangkai kapal paling terkenal dalam sejarah.

Pada tanggal 15 April 1912, selama pelayaran perdananya, RMS Titanic yang secara luas dianggap “tidak dapat tenggelam” tenggelam setelah bertabrakan dengan gunung es. Sayangnya, tidak tersedia cukup sekoci untuk semua penumpang, yang mengakibatkan 1502 kematian dari 2224 penumpang dan awak kapal.

Meskipun ada beberapa elemen keberuntungan yang terlibat dalam bertahan hidup, tampaknya beberapa kelompok orang lebih mungkin untuk bertahan hidup daripada yang lain.

Dalam tulisan ini daya membuat model prediksi yang menjawab pertanyaan: “orang seperti apa yang lebih mungkin untuk bertahan hidup?” menggunakan data penumpang (yaitu nama, usia, jenis kelamin, kelas sosial ekonomi, dll).

Metode and Tujuan

Metode yang akan digunakan pada artikel ini adalah decision tree

Data Preparation

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(caret) # to pre-process data
library(yardstick)
library(rpart)
library(tidyr)

Data Input

Titanic_Survival_train <- read.csv("data_input/train.csv")
Titanic_Survival_test <- read.csv("data_input/test.csv")
rmarkdown::paged_table(Titanic_Survival_train)
rmarkdown::paged_table(Titanic_Survival_test)
Titanic_Survival_test$Survived <- NA
Titanic_Survival_combined <- rbind(Titanic_Survival_train, Titanic_Survival_test)
glimpse(Titanic_Survival_combined)
#> Rows: 1,309
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6",...
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

Data tersebut mengandung beberapa variabel berikut :

  • Passenger ID: Nomor ID penumpang
  • Survived : Selamat atau tidaknya penumpang (0 - selamat, 1 = tidak selamat)
  • Pclass : Kelas tiket (Kelas 1, 2 dan 3)
  • SibSp : Saudara / pasangan yang ikut di Titanic
  • Parch : Orang tua / anak yang ikut di Titanic
  • Ticket : Nomor tiket
  • Fare : Biaya penumpang
  • Cabin : Nomor kabin
  • Embarked : Pelabuhan keberangkatan (Cherbourg, Queenstown dan Southampton)

Data Cleaning

# Convert all blanks to NA
Titanic_Survival_combined[Titanic_Survival_combined==""]  <- NA 
# Cek NA 
miss <- sapply(Titanic_Survival_combined[-2], function(x) sum(is.na(x)))
rmarkdown::paged_table(as.data.frame(miss))

pertama saya mengisi data blank yang ada diage menggunakan decision tree.

# Menggunakan Decision tree untuk menduga Misisng value
NoMissing <- Titanic_Survival_combined %>% filter(!is.na(Age))
# membuat prediksi untuk Age
fit.missing <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked, 
                  data = NoMissing)
summary(fit.missing)
#> Call:
#> rpart(formula = Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked, 
#>     data = NoMissing)
#>   n= 1046 
#> 
#>           CP nsplit rel error    xerror       xstd
#> 1 0.15460486      0 1.0000000 1.0023187 0.04532031
#> 2 0.06667568      1 0.8453951 0.8488128 0.03989784
#> 3 0.02864305      2 0.7787195 0.7834209 0.03643694
#> 4 0.02055366      3 0.7500764 0.7664516 0.03657348
#> 5 0.01911014      4 0.7295228 0.7611593 0.03712603
#> 6 0.01266124      5 0.7104126 0.7385583 0.03596580
#> 7 0.01000000      6 0.6977514 0.7187113 0.03507740
#> 
#> Variable importance
#>   Pclass     Fare    Parch    SibSp Embarked 
#>       35       29       24        7        5 
#> 
#> Node number 1: 1046 observations,    complexity param=0.1546049
#>   mean=29.88114, MSE=207.5502 
#>   left son=2 (762 obs) right son=3 (284 obs)
#>   Primary splits:
#>       Pclass   < 1.5      to the right, improve=0.15460490, (0 missing)
#>       SibSp    < 2.5      to the right, improve=0.07107333, (0 missing)
#>       Fare     < 49.5021  to the left,  improve=0.05839866, (1 missing)
#>       Parch    < 0.5      to the right, improve=0.05804487, (0 missing)
#>       Embarked splits as  RLL,          improve=0.00759466, (2 missing)
#>   Surrogate splits:
#>       Fare     < 47       to the left,  agree=0.904, adj=0.648, (0 split)
#>       Embarked splits as  RLL,          agree=0.771, adj=0.155, (0 split)
#> 
#> Node number 2: 762 observations,    complexity param=0.06667568
#>   mean=26.4229, MSE=162.254 
#>   left son=4 (198 obs) right son=5 (564 obs)
#>   Primary splits:
#>       Parch    < 0.5      to the right, improve=0.11707700, (0 missing)
#>       SibSp    < 1.5      to the right, improve=0.09360196, (0 missing)
#>       Fare     < 27.7354  to the right, improve=0.03743659, (1 missing)
#>       Pclass   < 2.5      to the right, improve=0.03053381, (0 missing)
#>       Embarked splits as  LRR,          improve=0.01442540, (0 missing)
#>   Surrogate splits:
#>       Fare  < 26.125   to the right, agree=0.825, adj=0.328, (0 split)
#>       SibSp < 0.5      to the right, agree=0.790, adj=0.192, (0 split)
#> 
#> Node number 3: 284 observations,    complexity param=0.02055366
#>   mean=39.15993, MSE=210.8999 
#>   left son=6 (31 obs) right son=7 (253 obs)
#>   Primary splits:
#>       Parch    < 1.5      to the right, improve=0.074498790000, (0 missing)
#>       Fare     < 51.93125 to the right, improve=0.041836000000, (0 missing)
#>       Sex      splits as  LR,           improve=0.018811630000, (0 missing)
#>       SibSp    < 1.5      to the right, improve=0.005750945000, (0 missing)
#>       Embarked splits as  LLR,          improve=0.000008227871, (2 missing)
#>   Surrogate splits:
#>       Fare  < 254.9479 to the right, agree=0.908, adj=0.161, (0 split)
#>       SibSp < 2.5      to the right, agree=0.905, adj=0.129, (0 split)
#> 
#> Node number 4: 198 observations,    complexity param=0.02864305
#>   mean=19.06692, MSE=226.2373 
#>   left son=8 (181 obs) right son=9 (17 obs)
#>   Primary splits:
#>       Parch  < 2.5      to the left,  improve=0.13881770, (0 missing)
#>       SibSp  < 1.5      to the right, improve=0.12936180, (0 missing)
#>       Pclass < 2.5      to the right, improve=0.03014899, (0 missing)
#>       Sex    splits as  RL,           improve=0.01292295, (0 missing)
#>       Fare   < 19.37915 to the left,  improve=0.01150715, (0 missing)
#> 
#> Node number 5: 564 observations,    complexity param=0.01266124
#>   mean=29.00532, MSE=114.1267 
#>   left son=10 (374 obs) right son=11 (190 obs)
#>   Primary splits:
#>       Pclass   < 2.5      to the right, improve=0.042703610, (0 missing)
#>       Fare     < 9.54375  to the left,  improve=0.023874460, (1 missing)
#>       Embarked splits as  LRR,          improve=0.012418780, (0 missing)
#>       Sex      splits as  LR,           improve=0.010263810, (0 missing)
#>       SibSp    < 1.5      to the right, improve=0.002706347, (0 missing)
#>   Surrogate splits:
#>       Fare < 10.3354  to the left,  agree=0.904, adj=0.716, (0 split)
#> 
#> Node number 6: 31 observations
#>   mean=27.83613, MSE=322.8546 
#> 
#> Node number 7: 253 observations
#>   mean=40.54743, MSE=179.5452 
#> 
#> Node number 8: 181 observations,    complexity param=0.01911014
#>   mean=17.34945, MSE=205.3523 
#>   left son=16 (50 obs) right son=17 (131 obs)
#>   Primary splits:
#>       SibSp    < 1.5      to the right, improve=0.111619700, (0 missing)
#>       Pclass   < 2.5      to the right, improve=0.060114540, (0 missing)
#>       Fare     < 8.46045  to the right, improve=0.011114890, (0 missing)
#>       Sex      splits as  RL,           improve=0.006643809, (0 missing)
#>       Embarked splits as  LLR,          improve=0.006229641, (0 missing)
#>   Surrogate splits:
#>       Fare     < 27.825   to the right, agree=0.779, adj=0.20, (0 split)
#>       Embarked splits as  RLR,          agree=0.740, adj=0.06, (0 split)
#> 
#> Node number 9: 17 observations
#>   mean=37.35294, MSE=82.81661 
#> 
#> Node number 10: 374 observations
#>   mean=27.43182, MSE=95.7273 
#> 
#> Node number 11: 190 observations
#>   mean=32.10263, MSE=135.8776 
#> 
#> Node number 16: 50 observations
#>   mean=9.6, MSE=61.7875 
#> 
#> Node number 17: 131 observations
#>   mean=20.30725, MSE=228.478
fit.missing
#> n= 1046 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 1046 217097.500 29.88114  
#>    2) Pclass>=1.5 762 123637.600 26.42290  
#>      4) Parch>=0.5 198  44794.980 19.06692  
#>        8) Parch< 2.5 181  37168.760 17.34945  
#>         16) SibSp>=1.5 50   3089.375  9.60000 *
#>         17) SibSp< 1.5 131  29930.620 20.30725 *
#>        9) Parch>=2.5 17   1407.882 37.35294 *
#>      5) Parch< 0.5 564  64367.480 29.00532  
#>       10) Pclass>=2.5 374  35802.010 27.43182 *
#>       11) Pclass< 2.5 190  25816.750 32.10263 *
#>    3) Pclass< 1.5 284  59895.570 39.15993  
#>      6) Parch>=1.5 31  10008.490 27.83613 *
#>      7) Parch< 1.5 253  45424.930 40.54743 *
# membuat kolom age predicted
Titanic_Survival_combined <-Titanic_Survival_combined %>% 
  mutate(age.predicted = predict(fit.missing, .)) %>% 
  mutate(Age = ifelse(is.na(Age), age.predicted, Age))
# Remove predicted age kolom
Titanic_Survival_combined$age.predicted <- NULL
print(fit.missing)
#> n= 1046 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 1046 217097.500 29.88114  
#>    2) Pclass>=1.5 762 123637.600 26.42290  
#>      4) Parch>=0.5 198  44794.980 19.06692  
#>        8) Parch< 2.5 181  37168.760 17.34945  
#>         16) SibSp>=1.5 50   3089.375  9.60000 *
#>         17) SibSp< 1.5 131  29930.620 20.30725 *
#>        9) Parch>=2.5 17   1407.882 37.35294 *
#>      5) Parch< 0.5 564  64367.480 29.00532  
#>       10) Pclass>=2.5 374  35802.010 27.43182 *
#>       11) Pclass< 2.5 190  25816.750 32.10263 *
#>    3) Pclass< 1.5 284  59895.570 39.15993  
#>      6) Parch>=1.5 31  10008.490 27.83613 *
#>      7) Parch< 1.5 253  45424.930 40.54743 *
# Me replace age dibawah 1 dengan 1
Titanic_Survival_combined$Age[Titanic_Survival_combined$Age < 1] <- 1

selanjutnya mengisi dtaa blank yang ada pada embarked dengan nilai modusnya.

table(Titanic_Survival_combined$Embarked)
#> 
#>   C   Q   S 
#> 270 123 914
Titanic_Survival_combined$Embarked[c(62,830)] = "S"
Titanic_Survival_combined$Embarked <- factor(Titanic_Survival_combined$Embarked)

Selanjutnya mengisi data fare yang blank menggunakan median

# replace missing fare value 
Titanic_Survival_combined$Fare[1044] <- median(Titanic_Survival_combined$Fare, na.rm = TRUE)

Mengekstraksi Title dalam nama

# change to character
Titanic_Survival_combined$Name <- as.character(Titanic_Survival_combined$Name)

# Extract title from passenger name
Titanic_Survival_combined$Title <- gsub("^.*, (.*?)\\..*$", "\\1", Titanic_Survival_combined$Name)

Titanic_Survival_combined$Title[Titanic_Survival_combined$Title == 'Mlle' | Titanic_Survival_combined$Title == 'Ms'] <- 'Miss' 
Titanic_Survival_combined$Title[Titanic_Survival_combined$Title == 'Mme']  <- 'Mrs' 

# Menggabungkan title langka, potensi title tinggi sosial tinggi
Other <- c('Dona', 'Dr', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Jonkheer', 'Major', 'Rev', 'Sir')

Titanic_Survival_combined$Title[Titanic_Survival_combined$Title %in% Other]  <- 'Other'

Titanic_Survival_combined$Title <- factor(Titanic_Survival_combined$Title)

Titanic_Survival_combined %>% 
  dplyr::select(Sex,Title) %>% 
  dplyr::group_by(Title, Sex) %>% 
  dplyr::summarise(n = n()) %>% 
  ungroup() %>% 
  pivot_wider(id_cols=Sex, names_from=Title, values_from=n,values_fill=list(n=0))
Titanic_Survival_combined$Sex <- as.factor(Titanic_Survival_combined$Sex)
Titanic_Survival_combined$Survived <- as.factor(Titanic_Survival_combined$Survived)
Titanic_Survival_combined$Pclass <- as.ordered(Titanic_Survival_combined$Pclass)

Eksplorasi Data

ggplot(Titanic_Survival_combined[1:891,], aes(x = Survived, fill = Survived)) +
  geom_bar(stat='count') +
  labs(x = 'Berapa Banyak Penumpang yang Meninggal dan Selamat?') +
        geom_label(stat='count',aes(label=..count..), size=7) +
        theme_grey(base_size = 18)

sex_plot <- ggplot(Titanic_Survival_combined[1:891,], aes(x = Sex, fill = Survived)) +
  geom_bar(stat='count', position='dodge') + theme_grey() +
  labs(x = 'Training data only') +
        geom_label(stat='count', aes(label=..count..))

sex_plot

Berdasarkan gambar diatas menunjukan laki-laki cenderung banyak yang tdiak selamat.

ggplot(data=Titanic_Survival_combined[1:891,], aes(x=Age, fill=Survived)) + geom_density(alpha=0.5)

Berdasarkan gambar diatas, saya melihat ada kecenderungan penumpang muda selamat dari kecelakaan itu. Pertimbangan untuk menyelamatkan penumpang muda mungkin muncul selama masa kritis.

Titanic_Survival_combined$AgeGroup <- NA
Titanic_Survival_combined$AgeGroup[Titanic_Survival_combined$Age <= 12] <- "Child"
Titanic_Survival_combined$AgeGroup[Titanic_Survival_combined$Age > 12 & Titanic_Survival_combined$Age <= 18] <- "Adolescent"
Titanic_Survival_combined$AgeGroup[Titanic_Survival_combined$Age > 18 & Titanic_Survival_combined$Age <= 59] <- "Adult"
Titanic_Survival_combined$AgeGroup[Titanic_Survival_combined$Age >= 60] <- "Elderly Adult"
age_group_plot <- ggplot(Titanic_Survival_combined[1:891,], aes(x = AgeGroup, fill = Survived)) +
  geom_bar(stat='count', position='dodge') + labs(x = 'Training data only') +
        theme(legend.position="none") + theme_grey()
age_group_plot

pclass_plot <- ggplot(Titanic_Survival_combined[1:891,], aes(x = Pclass, fill = Survived)) +
  geom_bar(stat='count', position='dodge') + labs(x = 'Training data only') +
        theme(legend.position="none") + theme_grey()
pclass_plot

Penumpang yang meninggal sebagian besar berasal dari kelas 3, kelas paling bawah sedangkan penumpang di kelas satu memberikan porsi terbanyak dari jumlah yang selamat.

ggplot(Titanic_Survival_combined[1:891,], aes(x = Title, fill = Survived)) +
  geom_bar(stat='count', position='stack') +
  labs(x = 'Title')+geom_label(stat='count',aes(label=..count..), size=3)+theme_grey()

Tampaknya title langka tidak terlalu berpengaruh dalam keselematan.

Splitting Data Train-Test

set.seed(100)

train <- Titanic_Survival_combined[1:891,]
prop.table(table(train$Survived))
#> 
#>         0         1 
#> 0.6161616 0.3838384
library(caret)
train_new <- downSample(x = train[, -2], y = train[, 2], yname = "Survived")
prop.table(table(train_new$Survived))
#> 
#>   0   1 
#> 0.5 0.5
set.seed(102)

train_new<-train_new %>% dplyr::select(Survived, Sex, Fare, Embarked, Title, AgeGroup, SibSp, Parch)

index <- sample(nrow(train_new), nrow(train_new)*0.7)

data_train <- train_new[index, ]
data_test <- train_new[-index, ]
prop.table(table(data_train$Survived))
#> 
#>         0         1 
#> 0.4790795 0.5209205

Membangun Model

library(rpart.plot)
library(rattle)
tree.survival <-  rpart(Survived~., data=data_train, method = "class")
fancyRpartPlot(tree.survival)

Evaluasi Model Data Test

tree.survival.predict <- predict(tree.survival, data_test, type="class")
library(caret)
confusionMatrix(as.factor(tree.survival.predict), as.factor(data_test$Survived))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 92 20
#>          1 21 73
#>                                              
#>                Accuracy : 0.801              
#>                  95% CI : (0.7398, 0.8532)   
#>     No Information Rate : 0.5485             
#>     P-Value [Acc > NIR] : 0.00000000000002826
#>                                              
#>                   Kappa : 0.5985             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.8142             
#>             Specificity : 0.7849             
#>          Pos Pred Value : 0.8214             
#>          Neg Pred Value : 0.7766             
#>              Prevalence : 0.5485             
#>          Detection Rate : 0.4466             
#>    Detection Prevalence : 0.5437             
#>       Balanced Accuracy : 0.7996             
#>                                              
#>        'Positive' Class : 0                  
#> 

Evaluasi Model Data Training

tree.survival.predict_train <- predict(tree.survival, data_train, type="class")
library(caret)
confusionMatrix(as.factor(tree.survival.predict_train), as.factor(data_train$Survived))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 195  45
#>          1  34 204
#>                                              
#>                Accuracy : 0.8347             
#>                  95% CI : (0.7983, 0.8669)   
#>     No Information Rate : 0.5209             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.6695             
#>                                              
#>  Mcnemar's Test P-Value : 0.2606             
#>                                              
#>             Sensitivity : 0.8515             
#>             Specificity : 0.8193             
#>          Pos Pred Value : 0.8125             
#>          Neg Pred Value : 0.8571             
#>              Prevalence : 0.4791             
#>          Detection Rate : 0.4079             
#>    Detection Prevalence : 0.5021             
#>       Balanced Accuracy : 0.8354             
#>                                              
#>        'Positive' Class : 0                  
#> 

Kesimpulan

Berdasarkan model diatas didapatkan hasil akurasi dari data train sebesar 83% dan data test sebesar 80%. Hasil tersebut sudah baik untuk digunakan prediksi.s

 

A work by Briandamar Kencana

damarbrian@gmail.com