Prediction of customer Deposit at the Bank

Explanation

Permulaan

Hallo ini adalah Rpubs ke-enam saya mengenai Supervised Learning Dengan Metode Klasifikasi, semoga bermanfaat :)

Tentang Data

Dataset ini adalah terkait marketing campaign pada Bank Portugis

Beberapa kolom yang terdapat di dalam nya :

age : umur
job : jenis pekerjaan (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : status pernikahan (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education : pendidikan (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: terdapat gagal bayar / tidak (categorical: ‘no’,‘yes’,‘unknown’)
housing: terdapat pinjaman rumah / tidak (categorical: ‘no’,‘yes’,‘unknown’)
loan: terdapat pinjaman / tidak (categorical: ‘no’,‘yes’,‘unknown’)
contact: jenis kontak komunikasi (categorical: ‘cellular’,‘telephone’)
month: kontak terakhir bulan tahun (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: hari kontak terakhir dalam seminggu (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
duration: durasi kontak terakhir, dalam detik (numeric).
campaign: jumlah kontak yang dilakukan selama kampanye ini dan untuk klien ini (numerik, termasuk kontak terakhir)
pdays:jumlah hari yang berlalu setelah klien terakhir dihubungi dari kampanye sebelumnya (angka; 999 berarti klien tidak pernah dihubungi sebelumnya)
previous: jumlah kontak yang dilakukan sebelum kampanye ini dan untuk klien ini (numerik)
poutcome: hasil dari kampanye pemasaran sebelumnya (kategoris: ‘gagal’, ‘tidak ada’, ‘sukses’)

Business Goal

Disini saya membuat Role-Play yang dimana saya adalah pekerja pada suatu Bank di Indonesia. Kemudian saya dan pihak business ingin melakukan analisa terhadap minat seorang nasabah di dalam melakukan Deposit, sehingga kita dapat membuat suatu strategi campaign yang mampu menaikan minat nasabah untuk melakukan Deposit. Di samping itu juga kita ingin membuat sebuah Model Machine Learning yang dapat memprediksi apakah seorang akan melakukan Deposit atau tidak.

Setup Library

kita disini akan membutuhkan beberapa library baik itu untuk visualisasi maupun pembentukan model

library(tidyverse)
library(ggpubr)
library(e1071)
library(caret) 
library(ROCR)
library(plotly)
library(partykit)

Data Wrangling

Read data terlebih dahulu

bank <- read.csv("data_input/bank.csv", stringsAsFactors = T)

Cek struktur data

str(bank)

#> 'data.frame':    11162 obs. of  17 variables:
#>  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 1 1 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ day      : int  5 5 5 5 5 5 6 6 6 6 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#>  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
#>  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ deposit  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

Cek missing value

anyNA(bank)

#> [1] FALSE

Cek proporsi kelas target

prop.table(table(bank$deposit))

#> 
#>        no       yes 
#> 0.5261602 0.4738398

ggplot(bank, aes(x = deposit)) +
  geom_histogram(
    stat = "count",
    colour = "black",
    aes(fill = deposit),
    show.legend = F
  ) +
  labs(x = "Deposit", y = "Freq", title = "Proporsi kelas Target") +
  theme_minimal()

Dapat dilihat proporsi kelas cukup balance yang dimana terlihat selish kelas Positif (yes) dan Negatif (no) terpaut cukup kecil / tidak signifikan extreme berbeda. Jadi saya rasa kita tidak perlu melakukan downSampling / upSampling

Exploratory Data Analysis

Kita akan coba melakukan analisa terkait keberhasilan dan ketidak-berhasilan suatu marketing campaign pada Data Bank Kita yang dimana akan membuat nasabah melakukan Deposit atau tidak.

Analisa keberhasilan Nasabah melakukan Deposit atau tidak berdasarkan variable Prediktor Age (Umur)

freq_age <- bank %>%
  select(age, deposit) %>%
  table %>%
  as.data.frame

freq_age_yes <- freq_age %>%
  filter(deposit == "yes") %>%
  arrange(deposit, desc(Freq))

freq_age_no <- freq_age %>%
  filter(deposit == "no") %>%
  arrange(deposit, desc(Freq))

#> List of 93
#>  $ line                      :List of 6
#>   ..$ colour       : chr "black"
#>   ..$ size         : num 0.5
#>   ..$ linetype     : num 1
#>   ..$ lineend      : chr "butt"
#>   ..$ arrow        : logi FALSE
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_line" "element"
#>  $ rect                      :List of 5
#>   ..$ fill         : chr "white"
#>   ..$ colour       : chr "black"
#>   ..$ size         : num 0.5
#>   ..$ linetype     : num 1
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
#>  $ text                      :List of 11
#>   ..$ family       : chr ""
#>   ..$ face         : chr "plain"
#>   ..$ colour       : chr "black"
#>   ..$ size         : num 11
#>   ..$ hjust        : num 0.5
#>   ..$ vjust        : num 0.5
#>   ..$ angle        : num 0
#>   ..$ lineheight   : num 0.9
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : logi FALSE
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ title                     : NULL
#>  $ aspect.ratio              : NULL
#>  $ axis.title                : NULL
#>  $ axis.title.x              :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 1
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.title.x.top          :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 0
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.title.x.bottom       : NULL
#>  $ axis.title.y              :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 1
#>   ..$ angle        : num 90
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.title.y.left         : NULL
#>  $ axis.title.y.right        :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 0
#>   ..$ angle        : num -90
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.text                 :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : chr "grey30"
#>   ..$ size         : 'rel' num 0.8
#>   ..$ hjust        : NULL
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.text.x               :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 1
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.text.x.top           :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : num 0
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.text.x.bottom        : NULL
#>  $ axis.text.y               :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : num 1
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.text.y.left          : NULL
#>  $ axis.text.y.right         :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : num 0
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ axis.ticks                : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ axis.ticks.x              : NULL
#>  $ axis.ticks.x.top          : NULL
#>  $ axis.ticks.x.bottom       : NULL
#>  $ axis.ticks.y              : NULL
#>  $ axis.ticks.y.left         : NULL
#>  $ axis.ticks.y.right        : NULL
#>  $ axis.ticks.length         : 'simpleUnit' num 2.75points
#>   ..- attr(*, "unit")= int 8
#>  $ axis.ticks.length.x       : NULL
#>  $ axis.ticks.length.x.top   : NULL
#>  $ axis.ticks.length.x.bottom: NULL
#>  $ axis.ticks.length.y       : NULL
#>  $ axis.ticks.length.y.left  : NULL
#>  $ axis.ticks.length.y.right : NULL
#>  $ axis.line                 : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ axis.line.x               : NULL
#>  $ axis.line.x.top           : NULL
#>  $ axis.line.x.bottom        : NULL
#>  $ axis.line.y               : NULL
#>  $ axis.line.y.left          : NULL
#>  $ axis.line.y.right         : NULL
#>  $ legend.background         : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
#>   ..- attr(*, "unit")= int 8
#>  $ legend.spacing            : 'simpleUnit' num 11points
#>   ..- attr(*, "unit")= int 8
#>  $ legend.spacing.x          : NULL
#>  $ legend.spacing.y          : NULL
#>  $ legend.key                : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ legend.key.size           : 'simpleUnit' num 1.2lines
#>   ..- attr(*, "unit")= int 3
#>  $ legend.key.height         : NULL
#>  $ legend.key.width          : NULL
#>  $ legend.text               :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : 'rel' num 0.8
#>   ..$ hjust        : NULL
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ legend.text.align         : NULL
#>  $ legend.title              :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : num 0
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ legend.title.align        : NULL
#>  $ legend.position           : chr "right"
#>  $ legend.direction          : NULL
#>  $ legend.justification      : chr "center"
#>  $ legend.box                : NULL
#>  $ legend.box.just           : NULL
#>  $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
#>   ..- attr(*, "unit")= int 1
#>  $ legend.box.background     : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ legend.box.spacing        : 'simpleUnit' num 11points
#>   ..- attr(*, "unit")= int 8
#>  $ panel.background          : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ panel.border              : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ panel.spacing             : 'simpleUnit' num 5.5points
#>   ..- attr(*, "unit")= int 8
#>  $ panel.spacing.x           : NULL
#>  $ panel.spacing.y           : NULL
#>  $ panel.grid                :List of 6
#>   ..$ colour       : chr "grey92"
#>   ..$ size         : NULL
#>   ..$ linetype     : NULL
#>   ..$ lineend      : NULL
#>   ..$ arrow        : logi FALSE
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_line" "element"
#>  $ panel.grid.major          : NULL
#>  $ panel.grid.minor          :List of 6
#>   ..$ colour       : NULL
#>   ..$ size         : 'rel' num 0.5
#>   ..$ linetype     : NULL
#>   ..$ lineend      : NULL
#>   ..$ arrow        : logi FALSE
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_line" "element"
#>  $ panel.grid.major.x        : NULL
#>  $ panel.grid.major.y        : NULL
#>  $ panel.grid.minor.x        : NULL
#>  $ panel.grid.minor.y        : NULL
#>  $ panel.ontop               : logi FALSE
#>  $ plot.background           : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ plot.title                :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : 'rel' num 1.2
#>   ..$ hjust        : num 0
#>   ..$ vjust        : num 1
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ plot.title.position       : chr "panel"
#>  $ plot.subtitle             :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : num 0
#>   ..$ vjust        : num 1
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ plot.caption              :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : 'rel' num 0.8
#>   ..$ hjust        : num 1
#>   ..$ vjust        : num 1
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ plot.caption.position     : chr "panel"
#>  $ plot.tag                  :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : 'rel' num 1.2
#>   ..$ hjust        : num 0.5
#>   ..$ vjust        : num 0.5
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ plot.tag.position         : chr "topleft"
#>  $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
#>   ..- attr(*, "unit")= int 8
#>  $ strip.background          : list()
#>   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
#>  $ strip.background.x        : NULL
#>  $ strip.background.y        : NULL
#>  $ strip.placement           : chr "inside"
#>  $ strip.text                :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : chr "grey10"
#>   ..$ size         : 'rel' num 0.8
#>   ..$ hjust        : NULL
#>   ..$ vjust        : NULL
#>   ..$ angle        : NULL
#>   ..$ lineheight   : NULL
#>   ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
#>   .. ..- attr(*, "unit")= int 8
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ strip.text.x              : NULL
#>  $ strip.text.y              :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : NULL
#>   ..$ angle        : num -90
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
#>   ..- attr(*, "unit")= int 8
#>  $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
#>   ..- attr(*, "unit")= int 8
#>  $ strip.text.y.left         :List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : NULL
#>   ..$ vjust        : NULL
#>   ..$ angle        : num 90
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi TRUE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  - attr(*, "class")= chr [1:2] "theme" "gg"
#>  - attr(*, "complete")= logi TRUE
#>  - attr(*, "validate")= logi TRUE

# gabungkan ke-empat plot
ggarrange(p_one, p_two, p_three, p_four,
          ncol = 2, nrow = 2,
          labels = c("A", "B", "C", "D"))

Dari plot diatas kita dapat simpulkan beberapa hal: - Dapat di lihat dari plot A dan B, yang dimana banyak nasabah yang melakukan Deposit berada kebanyakan rentang umur 28 - 37 dan bagi nasabah yang telah berumur (tua), cendrung sedikit di dalam melakukan Deposit - Kemudian dari plot C dan D. Kebanyakan nasabah yang tidak melakukan Deposit berada pada rentang umur 30 - 40 dan kembali pada nasabah yang telah berumur, lebih memilih untuk tidak melakukan Deposit. Dari sini dapat kita simpulkan rata-rata nasabah yang melakukan Deposit berada pada rentang umur yang mudah, dan nasabah yang tidak melakukan Deposit berada pada umur yang tergolong tua. Ini masuk akal jika mengingat banyak Kawula Muda yang tentu ingin mempersiapkan masa tua nya dengan melakukan Deposito sebagai salah satu bentuk investasi. Dan bagi nasabah yang telah berumur cendrung untuk tidak melakukan Deposito mengingat fungsi Deposito sebagai tabungan jangka panjang yang penantian nya tidak sesuai lagi dengan umur mereka

Kemudian mari kita lakukan analisa berdasarkan Variable predikto Marital (Pernikahan)

freq_marital <- bank %>%
  select(marital, deposit) %>%
  table %>% 
  as.data.frame

ggplot(freq_marital, aes(x = marital, y = Freq, fill = deposit)) + geom_col(aes(fill = deposit), position ="dodge") +
  labs(title = "Frekuensi Deposito Berdasarkan Status Pernikahan", y = "Freq", x = "Marital") +
  theme_minimal() +
  labs(x = NULL, fill = "Deposit") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

Terlihat dari grafik diatas. Status Pernikahan Single lebih cendrung memilih untuk melakuakn Deposito di bandingkan yang telah menikah apalagi yang bercerai. Ini masuk akal jika kita melihat kehidupan seseorang yang Single lebih memiliki pengeluaran yang tergolong sedikit, sehingga lebih memungkinkan untuk melakukan Deposito di bandingkan yang telah menikah dan bercerai

Mari lanjutkan analisa terhadap variable prediktor Default (Gagal bayar pinjaman)

freq_default <- bank %>%
  select(default , deposit) %>%
  table %>% 
  as.data.frame

ggplot(freq_default, aes(x = default, y = Freq, fill = deposit)) + geom_col(aes(fill = deposit), position ="dodge") +
  labs(title = "Freq Deposit by Default", y = "Freq", x = "Deposit") +
  theme_minimal() +
  labs(fill = "Deposit") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

Terlihat dari plot di atas. Nasabah yang sudah pernah default (Gagal membayar pinjaman) lebih cendrung sedikit melakukan Deposito. Tetapi nasabah yang tidak pernah Default, lebih memiliki kemungkinan untuk melakukan Deposito walaupun lebih banyak yang tidak melakukan Deposito.

Terakhir kita akan melakukan analisa terhadap variable prediktor Loan (Pinjaman)

freq_loan <- bank %>%
  select(loan, deposit) %>%
  table %>% 
  as.data.frame

ggplot(freq_loan, aes(x = loan, y = Freq, fill = deposit)) + geom_col(aes(fill = deposit), position ="dodge") +
  labs(title = "Freq Deposit by Loan", y = "Freq", x = "Loan") +
  theme_minimal() +
  labs(fill = "Deposit") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    
  )

Sesuai dugaan. Nasabah yang memiliki pinjaman secara logika nya sudah pasti tidak akan melakukan Deposito, terlihat jelas pada plot diatas. Namun untuk Nasabah yang tidak memiliki pinjaman, cukup banyak yang melakukan Deposito di samping yang tidak melakukan Deposito.

Mari kita cek korelasi antar variable Numerik

# Matrix Correlation
library(GGally)
bank2 <- read.csv("data_input/bank.csv")
ggcorr(
  bank2,
  hjust = 1,
  layout.exp = 2,
  label = T,
  label_size = 2.9
)

Terlihat diatas tidak terdapat variable yang memiliki korelasi kuat antar prediktor nya. Tetapi pada kasus kita kali ini, hal itu tidak penting karena Model yang akan kita gunakan semuanya bebas asumsi.

Cross Validation

# Cross Validation
set.seed(100)

index <- sample(x = nrow(bank), size= nrow(bank)*0.8)
bank_train <- bank[index,] # subsetting data berdasarkan index data yang ada di variabel index 
bank_test <- bank[-index,] 
prop.table(table(bank_train$deposit))

#> 
#>        no       yes 
#> 0.5244708 0.4755292

Terlihat Proporsi target variable nya cukup seimbang, tidak berbeda secara signifikan extreme

Naive Bayes

Naive Bayes adalah suatu algoritma klasifikasi yang didasari oleh Bayes’ Theorem of Probability. algoritma ini sendiri sering di jadikan sebagai Base Model / Benchmark Model guna membandingkan dengan algoritma kompleks lainnya karena model ini cukup Robust yang menganggap semua prediktor sama penting (semua prediktor independen) dan semua prediktor berpengaruh juga terhadap variable target serta komputasi model yang lebih ringan.

Mari kita latih Model Naive Bayes kita menggunakan Data Train yang sudah kita buat tadi

model_bank <- naiveBayes(deposit ~ ., data = bank_train, laplace = 1)

# model interpretasi
model_bank

#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>        no       yes 
#> 0.5244708 0.4755292 
#> 
#> Conditional probabilities:
#>      age
#> Y         [,1]     [,2]
#>   no  40.92227 10.24666
#>   yes 41.69336 13.52419
#> 
#>      job
#> Y          admin. blue-collar entrepreneur   housemaid  management     retired
#>   no  0.122896699 0.207028754  0.033439830 0.029392971 0.215761448 0.043024494
#>   yes 0.119069986 0.134335369  0.023954908 0.019492720 0.244246125 0.097698450
#>      job
#> Y     self-employed    services     student  technician  unemployed     unknown
#>   no    0.039829606 0.089456869 0.016187433 0.169542066 0.027263046 0.006176784
#>   yes   0.035697511 0.071395021 0.048849225 0.159934241 0.038046031 0.007280413
#> 
#>      marital
#> Y      divorced   married    single
#>   no  0.1131029 0.6148101 0.2720871
#>   yes 0.1200282 0.5212991 0.3586726
#> 
#>      education
#> Y        primary  secondary   tertiary    unknown
#>   no  0.15510988 0.51845530 0.28376360 0.04267122
#>   yes 0.11552941 0.46305882 0.37552941 0.04588235
#> 
#>      default
#> Y             no        yes
#>   no  0.97972252 0.02027748
#>   yes 0.98917137 0.01082863
#> 
#>      balance
#> Y         [,1]     [,2]
#>   no  1271.344 2830.382
#>   yes 1847.275 3708.306
#> 
#>      housing
#> Y            no       yes
#>   no  0.4264674 0.5735326
#>   yes 0.6358286 0.3641714
#> 
#>      loan
#> Y             no        yes
#>   no  0.83415155 0.16584845
#>   yes 0.90536723 0.09463277
#> 
#>      contact
#> Y       cellular  telephone    unknown
#>   no  0.62654716 0.06572770 0.30772514
#>   yes 0.82019299 0.07460579 0.10520122
#> 
#>      day
#> Y         [,1]     [,2]
#>   no  16.11232 8.361403
#>   yes 15.26354 8.525471
#> 
#>      month
#> Y             apr         aug         dec         feb         jan         jul
#>   no  0.060063898 0.141640043 0.001916933 0.057933972 0.034291800 0.150798722
#>   yes 0.113668389 0.126820103 0.019023016 0.079379991 0.025364021 0.118835134
#>      month
#> Y             jun         mar         may         nov         oct         sep
#>   no  0.114164004 0.004685836 0.320340788 0.094142705 0.011075612 0.008945687
#>   yes 0.104039455 0.046500705 0.176843589 0.077736026 0.062470643 0.049318929
#> 
#>      duration
#> Y        [,1]     [,2]
#>   no  224.372 211.0635
#>   yes 541.334 395.6650
#> 
#>      campaign
#> Y         [,1]     [,2]
#>   no  2.826607 3.164951
#>   yes 2.157560 1.954102
#> 
#>      pdays
#> Y         [,1]      [,2]
#>   no  35.63741  95.94072
#>   yes 68.21267 118.42614
#> 
#>      previous
#> Y         [,1]     [,2]
#>   no  0.512492 1.873967
#>   yes 1.166981 2.594395
#> 
#>      poutcome
#> Y        failure      other    success    unknown
#>   no  0.10646469 0.03819074 0.01642842 0.83891615
#>   yes 0.11576471 0.05882353 0.18282353 0.64258824

Dari hasil model Naive Bayes kita, dapat kita interpretasikan beberapa hal di antaranya: - Nasabah yang tidak memiliki pinjaman berpeluang 90% untuk melakukan Deposito - Nasabah yang tidak pernah gagal bayar pinjaman berpeluang 98% untuk melakukan Deposito - Dari sekian banyak pekerjaan (job) dari para nasabah. Yang paling besar peluang nya untuk melakukan Deposito dari pada tidak melakukan nya ialah Admin (11%), Manajemen (24%) dan Teknisi (15%)

Mari kita lakukan Prediksi dengan Model yang telah kita buat menggunakan Data Test yang telah kita buat sebelumnya

# model prediksi
bank_predClass <- predict(object = model_bank, newdata = bank_test, type = "class")

head(data.frame(Prediksi = bank_predClass, actual = bank_test$deposit))

#>   Prediksi actual
#> 1      yes    yes
#> 2       no    yes
#> 3       no    yes
#> 4      yes    yes
#> 5       no    yes
#> 6      yes    yes

Untuk menguji Performansi dari pada model kita, mari gunakan Confusion Matrix

conf_naive <- confusionMatrix(data = bank_predClass, reference = bank_test$deposit, positive = "yes") 
conf_naive

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  962 276
#>        yes 228 767
#>                                              
#>                Accuracy : 0.7743             
#>                  95% CI : (0.7564, 0.7915)   
#>     No Information Rate : 0.5329             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.5453             
#>                                              
#>  Mcnemar's Test P-Value : 0.0363             
#>                                              
#>             Sensitivity : 0.7354             
#>             Specificity : 0.8084             
#>          Pos Pred Value : 0.7709             
#>          Neg Pred Value : 0.7771             
#>              Prevalence : 0.4671             
#>          Detection Rate : 0.3435             
#>    Detection Prevalence : 0.4456             
#>       Balanced Accuracy : 0.7719             
#>                                              
#>        'Positive' Class : yes                
#>

Dari hasil Confusion Matrix diatas dapat kita simpulkan bahwa: - Recall / Sensitivity => Yang benar di prediksi positif dari yang realita nya (aktual nya) positif oleh Model kita ada 73% - Specifity => Yang benar di prediksi negatif dari yang realita nya (aktual nya) negatif oleh Model kita ada 80% - Precission Yang benar di prediksi positif dari yang di prediksi nya - positif oleh Model kita ada 77% - Accuracy => Seberapa banyak yang benar di prediksi dari keseluruhan data (positif maupun negatif) oleh model kita ada 77%

Kemudian kita akan melihat kualitas model kita melalui Plot ROC dan nilai AUC nya

# roc dan auc

# buat prediksi dalam bentuk probabilitas
bank_prob <- predict(object = model_bank, newdata = bank_test, type = "raw")

data_roc <- data.frame(pred_proba = bank_prob[, "yes"],
                       actual = ifelse(bank_test$deposit == "yes" ,1,0))

# buat objek prediction
roc_pred <- prediction(predictions = data_roc$pred_proba, labels = data_roc$actual)

# buat plot dari objek prediction
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))

auc_value <- performance(prediction.obj = roc_pred, measure = "auc")
auc_value@y.values # untuk menampilkan nilai AUC dari Plot ROC

#> [[1]]
#> [1] 0.8501478

Di lihat dari plot ROC nya, untuk False Positive Rate nya sudah mendakati 0, yan dimana itu artinya Kesalahn model kita di dalam memprediksi sudah semakin sedikit / kecil. Kemudian terkait True Positive Rate nya sudah mendekati 1, yang artinya model kita sudah semakin baik didalam memprediksi kelas Positif maupun Negatif. Di lihat dari nilai AUC nya pun sudah mendakati 1 (85%), dengan kata lain performansi model kita sudah semakin baik di dalam mengklasifikasikan kelas Positif maupun Negatif.

Dari hasil model kita diatas, sebenarnya kita masih bisa melakukan Tunning Model menggunakan perubahan Threshold. Mari kita coba memvisualisasikan hasil pergeseran Threshold yang akan meningkatkan Value ke-empat metrix kita (Accuracy, Precision, Recall dan Specifity)

predict_prob <- data.frame(negative = round(bank_prob[,1],4), positive = round(bank_prob[,2],4))

# model tuning - metrics function
metrics <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                       prob = predict_prob$positive, 
                       ref = as.factor(ifelse(bank_test$deposit == "yes", 1, 0)), 
                       postarget = "1", 
                       negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
                "Accuracy" = result[,2],
                "Precision" = result[,3],
                "Specificity" = result[,4],
                "Cutoff" = co) %>% 
           gather(key = "Metrics", value = "value", 1:4) %>% 
           ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
           geom_line(lwd = 1.5) +
           scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
           scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
           scale_x_continuous(breaks = seq(0,1,0.1)) +
           labs(title = "Tradeoff Model Perfomance") +
           theme_minimal() +
           theme(legend.position = "top",
                 panel.grid.minor.y = element_blank(),
                 panel.grid.minor.x = element_blank()))

Sepertinya Threshold default (0.5) sudah membuat model kita memiliki nilai yang cukup baik untuk metrix (Accuracy, Precision, Recall dan Specifity)

Decision Tree

Decision Tree merupakan Model yang cukup banyak di gunakan juga karena dapat di interpretasikan, bisa digunakan untuk prediktor numerik maupun kategorik ,bisa mengatasi outlier dan cukup robust

Mari langsung kita coba membuat Model Decision Tree nya

model_bank_tree <- ctree(formula = deposit ~ ., data = bank_train)

# struktru data pohon
model_bank_tree

#> 
#> Model formula:
#> deposit ~ age + job + marital + education + default + balance + 
#>     housing + loan + contact + day + month + duration + campaign + 
#>     pdays + previous + poutcome
#> 
#> Fitted party:
#> [1] root
#> |   [2] duration <= 206
#> |   |   [3] month in apr, dec, feb, mar, oct, sep
#> |   |   |   [4] duration <= 77
#> |   |   |   |   [5] day <= 20: no (n = 112, err = 7.1%)
#> |   |   |   |   [6] day > 20: no (n = 12, err = 50.0%)
#> |   |   |   [7] duration > 77
#> |   |   |   |   [8] poutcome in failure, other
#> |   |   |   |   |   [9] duration <= 124: no (n = 67, err = 19.4%)
#> |   |   |   |   |   [10] duration > 124
#> |   |   |   |   |   |   [11] housing in no: yes (n = 58, err = 32.8%)
#> |   |   |   |   |   |   [12] housing in yes: no (n = 49, err = 36.7%)
#> |   |   |   |   [13] poutcome in success, unknown
#> |   |   |   |   |   [14] month in apr, feb
#> |   |   |   |   |   |   [15] housing in no
#> |   |   |   |   |   |   |   [16] day <= 7: no (n = 63, err = 33.3%)
#> |   |   |   |   |   |   |   [17] day > 7: yes (n = 148, err = 15.5%)
#> |   |   |   |   |   |   [18] housing in yes
#> |   |   |   |   |   |   |   [19] poutcome in success: yes (n = 12, err = 8.3%)
#> |   |   |   |   |   |   |   [20] poutcome in unknown: no (n = 71, err = 26.8%)
#> |   |   |   |   |   [21] month in dec, mar, oct, sep
#> |   |   |   |   |   |   [22] campaign <= 4: yes (n = 202, err = 12.4%)
#> |   |   |   |   |   |   [23] campaign > 4: no (n = 9, err = 33.3%)
#> |   |   [24] month in aug, jan, jul, jun, may, nov
#> |   |   |   [25] poutcome in failure, other, unknown
#> |   |   |   |   [26] duration <= 127
#> |   |   |   |   |   [27] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unknown
#> |   |   |   |   |   |   [28] poutcome in failure, other
#> |   |   |   |   |   |   |   [29] month in aug, jan, jul, may, nov
#> |   |   |   |   |   |   |   |   [30] contact in cellular
#> |   |   |   |   |   |   |   |   |   [31] job in admin., blue-collar, entrepreneur, management, retired, services, technician: no (n = 159, err = 2.5%)
#> |   |   |   |   |   |   |   |   |   [32] job in housemaid, self-employed, unknown: no (n = 10, err = 30.0%)
#> |   |   |   |   |   |   |   |   [33] contact in telephone, unknown: no (n = 14, err = 14.3%)
#> |   |   |   |   |   |   |   [34] month in jun: yes (n = 8, err = 25.0%)
#> |   |   |   |   |   |   [35] poutcome in unknown
#> |   |   |   |   |   |   |   [36] marital in divorced, married: no (n = 1031, err = 1.0%)
#> |   |   |   |   |   |   |   [37] marital in single: no (n = 307, err = 3.9%)
#> |   |   |   |   |   [38] job in student, unemployed: no (n = 56, err = 19.6%)
#> |   |   |   |   [39] duration > 127
#> |   |   |   |   |   [40] housing in no
#> |   |   |   |   |   |   [41] poutcome in failure, other: yes (n = 67, err = 35.8%)
#> |   |   |   |   |   |   [42] poutcome in unknown
#> |   |   |   |   |   |   |   [43] day <= 5
#> |   |   |   |   |   |   |   |   [44] contact in cellular, telephone: yes (n = 44, err = 36.4%)
#> |   |   |   |   |   |   |   |   [45] contact in unknown: no (n = 16, err = 6.2%)
#> |   |   |   |   |   |   |   [46] day > 5
#> |   |   |   |   |   |   |   |   [47] education in primary, secondary
#> |   |   |   |   |   |   |   |   |   [48] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unknown: no (n = 205, err = 6.8%)
#> |   |   |   |   |   |   |   |   |   [49] job in student, unemployed: no (n = 13, err = 46.2%)
#> |   |   |   |   |   |   |   |   [50] education in tertiary, unknown
#> |   |   |   |   |   |   |   |   |   [51] month in aug, jul, may: no (n = 121, err = 13.2%)
#> |   |   |   |   |   |   |   |   |   [52] month in jan, jun, nov: no (n = 58, err = 41.4%)
#> |   |   |   |   |   [53] housing in yes
#> |   |   |   |   |   |   [54] month in aug, jan, jun, nov
#> |   |   |   |   |   |   |   [55] contact in cellular, telephone
#> |   |   |   |   |   |   |   |   [56] month in aug, jan, nov: no (n = 103, err = 14.6%)
#> |   |   |   |   |   |   |   |   [57] month in jun: yes (n = 9, err = 22.2%)
#> |   |   |   |   |   |   |   [58] contact in unknown: no (n = 44, err = 2.3%)
#> |   |   |   |   |   |   [59] month in jul, may
#> |   |   |   |   |   |   |   [60] poutcome in failure
#> |   |   |   |   |   |   |   |   [61] pdays <= 278: no (n = 7, err = 42.9%)
#> |   |   |   |   |   |   |   |   [62] pdays > 278: no (n = 29, err = 6.9%)
#> |   |   |   |   |   |   |   [63] poutcome in other, unknown: no (n = 374, err = 1.6%)
#> |   |   |   [64] poutcome in success
#> |   |   |   |   [65] duration <= 101: no (n = 24, err = 33.3%)
#> |   |   |   |   [66] duration > 101: yes (n = 105, err = 14.3%)
#> |   [67] duration > 206
#> |   |   [68] duration <= 405
#> |   |   |   [69] poutcome in failure, unknown
#> |   |   |   |   [70] month in apr, dec, feb, mar, oct, sep
#> |   |   |   |   |   [71] housing in no
#> |   |   |   |   |   |   [72] day <= 5
#> |   |   |   |   |   |   |   [73] month in apr, dec, mar
#> |   |   |   |   |   |   |   |   [74] balance <= 1348: yes (n = 19, err = 0.0%)
#> |   |   |   |   |   |   |   |   [75] balance > 1348: yes (n = 7, err = 28.6%)
#> |   |   |   |   |   |   |   [76] month in feb, oct, sep: no (n = 62, err = 41.9%)
#> |   |   |   |   |   |   [77] day > 5: yes (n = 286, err = 13.3%)
#> |   |   |   |   |   [78] housing in yes
#> |   |   |   |   |   |   [79] month in apr, feb
#> |   |   |   |   |   |   |   [80] day <= 20
#> |   |   |   |   |   |   |   |   [81] month in apr: no (n = 89, err = 15.7%)
#> |   |   |   |   |   |   |   |   [82] month in feb
#> |   |   |   |   |   |   |   |   |   [83] day <= 9: no (n = 25, err = 20.0%)
#> |   |   |   |   |   |   |   |   |   [84] day > 9: yes (n = 9, err = 0.0%)
#> |   |   |   |   |   |   |   [85] day > 20: yes (n = 24, err = 20.8%)
#> |   |   |   |   |   |   [86] month in dec, mar, oct, sep: yes (n = 43, err = 14.0%)
#> |   |   |   |   [87] month in aug, jan, jul, jun, may, nov
#> |   |   |   |   |   [88] contact in cellular, telephone
#> |   |   |   |   |   |   [89] month in aug, jan, jul, may, nov
#> |   |   |   |   |   |   |   [90] housing in no
#> |   |   |   |   |   |   |   |   [91] day <= 16
#> |   |   |   |   |   |   |   |   |   [92] loan in no
#> |   |   |   |   |   |   |   |   |   |   [93] job in admin., housemaid, management, retired, services, student, unemployed: yes (n = 161, err = 25.5%)
#> |   |   |   |   |   |   |   |   |   |   [94] job in blue-collar, entrepreneur, self-employed, technician, unknown: no (n = 84, err = 40.5%)
#> |   |   |   |   |   |   |   |   |   [95] loan in yes: no (n = 32, err = 18.8%)
#> |   |   |   |   |   |   |   |   [96] day > 16
#> |   |   |   |   |   |   |   |   |   [97] month in aug, jan, jul, nov
#> |   |   |   |   |   |   |   |   |   |   [98] pdays <= -1
#> |   |   |   |   |   |   |   |   |   |   |   [99] balance <= 1148: no (n = 155, err = 13.5%)
#> |   |   |   |   |   |   |   |   |   |   |   [100] balance > 1148: no (n = 61, err = 37.7%)
#> |   |   |   |   |   |   |   |   |   |   [101] pdays > -1: yes (n = 22, err = 31.8%)
#> |   |   |   |   |   |   |   |   |   [102] month in may: yes (n = 33, err = 24.2%)
#> |   |   |   |   |   |   |   [103] housing in yes: no (n = 381, err = 19.7%)
#> |   |   |   |   |   |   [104] month in jun: yes (n = 85, err = 9.4%)
#> |   |   |   |   |   [105] contact in unknown
#> |   |   |   |   |   |   [106] month in aug, jul, nov: no (n = 9, err = 33.3%)
#> |   |   |   |   |   |   [107] month in jun, may
#> |   |   |   |   |   |   |   [108] month in jun: no (n = 124, err = 7.3%)
#> |   |   |   |   |   |   |   [109] month in may
#> |   |   |   |   |   |   |   |   [110] job in admin., blue-collar, management, self-employed, services, student, technician, unemployed: no (n = 258, err = 0.4%)
#> |   |   |   |   |   |   |   |   [111] job in entrepreneur, retired: no (n = 10, err = 10.0%)
#> |   |   |   [112] poutcome in other, success
#> |   |   |   |   [113] poutcome in other
#> |   |   |   |   |   [114] housing in no: yes (n = 82, err = 15.9%)
#> |   |   |   |   |   [115] housing in yes: no (n = 48, err = 45.8%)
#> |   |   |   |   [116] poutcome in success: yes (n = 388, err = 5.7%)
#> |   |   [117] duration > 405
#> |   |   |   [118] duration <= 643
#> |   |   |   |   [119] contact in cellular, telephone
#> |   |   |   |   |   [120] month in apr, aug, dec, jun, mar, oct, sep
#> |   |   |   |   |   |   [121] housing in no: yes (n = 341, err = 10.9%)
#> |   |   |   |   |   |   [122] housing in yes: yes (n = 112, err = 25.9%)
#> |   |   |   |   |   [123] month in feb, jan, jul, may, nov
#> |   |   |   |   |   |   [124] poutcome in failure, other, unknown: yes (n = 538, err = 33.6%)
#> |   |   |   |   |   |   [125] poutcome in success
#> |   |   |   |   |   |   |   [126] marital in divorced, married: yes (n = 42, err = 0.0%)
#> |   |   |   |   |   |   |   [127] marital in single: yes (n = 14, err = 28.6%)
#> |   |   |   |   [128] contact in unknown
#> |   |   |   |   |   [129] duration <= 472: no (n = 61, err = 23.0%)
#> |   |   |   |   |   [130] duration > 472: yes (n = 161, err = 46.6%)
#> |   |   |   [131] duration > 643
#> |   |   |   |   [132] marital in divorced, single: yes (n = 704, err = 9.8%)
#> |   |   |   |   [133] marital in married: yes (n = 852, err = 16.2%)
#> 
#> Number of inner nodes:    66
#> Number of terminal nodes: 67

plot(model_bank_tree, type = "simple")

Dari hasil model kita diatas, dapat kita simpulkan beberapa point bahwa: - jik poutcome (hasil dari marketing campaign sebelumnya) sukses, maka status pernikaham baik itu divorced, married ataupun single, di prediksi akan melaukan Deposito - jika housing (pinjaman perumahan) tidak ada, makam pekerjaan admin, housemaid, management, retired, services, student, unemployed akan di prediksi melakukan Deposito juga

Mari kita lakukan evaluasi model agar mengtahui performansi dari pada model Decision Tree yang telah kita buat.

# model evaluasi
bank_pred_class <- predict(object = model_bank_tree, newdata = bank_test, type = "response")

conf_tree <- confusionMatrix(data = bank_pred_class,
                reference = bank_test$deposit, positive = "yes") 
conf_tree

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  979 156
#>        yes 211 887
#>                                                
#>                Accuracy : 0.8356               
#>                  95% CI : (0.8196, 0.8508)     
#>     No Information Rate : 0.5329               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6709               
#>                                                
#>  Mcnemar's Test P-Value : 0.004821             
#>                                                
#>             Sensitivity : 0.8504               
#>             Specificity : 0.8227               
#>          Pos Pred Value : 0.8078               
#>          Neg Pred Value : 0.8626               
#>              Prevalence : 0.4671               
#>          Detection Rate : 0.3972               
#>    Detection Prevalence : 0.4917               
#>       Balanced Accuracy : 0.8366               
#>                                                
#>        'Positive' Class : yes                  
#>

Dari hasil model Decision Tree kita di atas, dapat di simpulkan: 1. Recall / Sensitivity => Yang benar di prediksi positif dari yang realita nya (aktual nya) positif oleh Model kita ada 85% 2. Specifity => Yang benar di prediksi negatif dari yang realita nya (aktual nya) negatif oleh Model kita ada 82% 3. Precission Yang benar di prediksi positif dari yang di prediksi nya - positif oleh Model kita ada 80% 4. Accuracy => Seberapa banyak yang benar di prediksi dari keseluruhan data (positif maupun negatif) oleh model kita ada 83%

Sebelum kita melanjutkan pada Tunning Model dengan Threshold, kita akan menguji terlebih dahulu apakah model kita sudah Good Fit atau justru Overfitt. Mari kita uji pada Data Train

# cek pada data train
bank_pred_class_train <- predict(object = model_bank_tree, newdata = bank_train, type = "response")

confusionMatrix(data = bank_pred_class_train,
                reference = bank_train$deposit, positive = "yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  3873  480
#>        yes  810 3766
#>                                                
#>                Accuracy : 0.8555               
#>                  95% CI : (0.8481, 0.8628)     
#>     No Information Rate : 0.5245               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7114               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8870               
#>             Specificity : 0.8270               
#>          Pos Pred Value : 0.8230               
#>          Neg Pred Value : 0.8897               
#>              Prevalence : 0.4755               
#>          Detection Rate : 0.4218               
#>    Detection Prevalence : 0.5125               
#>       Balanced Accuracy : 0.8570               
#>                                                
#>        'Positive' Class : yes                  
#>

Terlihat hasil Accuracy prediksi kita terhadap Data Test dan Data Train tidak terpaut jauh. Pada Data Test Accuracy kita 83% dan pada Data Train 85%, itu artinya Accuray pada Data Test dan Data Train masih sangat dekat sehingga tidak tergolong Overfit.

Mari kita coba lihat plot ROC dan nilai AUC nya

# roc dan auc
bank_pred_prob <- predict(object = model_bank_tree, newdata = bank_test, type = "prob")

dtree_roc <- data.frame(pred_proba= bank_pred_prob[, "yes"],
                        actual=as.factor(ifelse(bank_test$deposit=="yes", 1, 0)))

roc_pred <- prediction(predictions = dtree_roc$pred_proba, 
                       labels = dtree_roc$actual)

# buat plot dari objek prediction
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))

auc_value <- performance(prediction.obj = roc_pred, measure = "auc")
auc_value@y.values # untuk menampilkan nilai AUC dari Plot ROC

#> [[1]]
#> [1] 0.9034105

Di lihat dari plot ROC nya, untuk False Positive Rate nya sudah mendakati 0, yan dimana itu artinya Kesalahn model kita di dalam memprediksi sudah semakin sedikit / kecil. Kemudian terkait True Positive Rate nya sudah mendekati 1, yang artinya model kita sudah semakin baik didalam memprediksi kelas Positif maupun Negatif. Di lihat dari nilai AUC nya pun sudah mendakati 1 (90%), dengan kata lain performansi model kita sudah semakin baik di dalam mengklasifikasikan kelas Positif maupun Negatif.

Mari kita coba Tunning Model nya dengan Pergeseran Threshold

predict_prob_tree <- data.frame(negative = round(bank_pred_prob[,1],4), positive = round(bank_pred_prob[,2],4))



# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                       prob = predict_prob_tree$positive, 
                       ref = as.factor(ifelse(bank_test$deposit == "yes", 1, 0)), 
                       postarget = "1", 
                       negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
                "Accuracy" = result[,2],
                "Precision" = result[,3],
                "Specificity" = result[,4],
                "Cutoff" = co) %>% 
           gather(key = "Metrics", value = "value", 1:4) %>% 
           ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
           geom_line(lwd = 1.5) +
           scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
           scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
           scale_x_continuous(breaks = seq(0,1,0.1)) +
           labs(title = "Tradeoff Model Perfomance") +
           theme_minimal() +
           theme(legend.position = "top",
                 panel.grid.minor.y = element_blank(),
                 panel.grid.minor.x = element_blank()))

Terlihat pada Cutoff / Threshold 0.6 nilai ke-empat metrix cukup meningkat mari kita coba untuk menggunakan Threshold 0.6

# tuning final model
confusionMatrix(data = as.factor(ifelse(bank_pred_prob[, "yes"] > 0.60, "yes", "no")),
                reference = bank_test$deposit, positive = "yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  994 167
#>        yes 196 876
#>                                              
#>                Accuracy : 0.8374             
#>                  95% CI : (0.8215, 0.8525)   
#>     No Information Rate : 0.5329             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.674              
#>                                              
#>  Mcnemar's Test P-Value : 0.1417             
#>                                              
#>             Sensitivity : 0.8399             
#>             Specificity : 0.8353             
#>          Pos Pred Value : 0.8172             
#>          Neg Pred Value : 0.8562             
#>              Prevalence : 0.4671             
#>          Detection Rate : 0.3923             
#>    Detection Prevalence : 0.4801             
#>       Balanced Accuracy : 0.8376             
#>                                              
#>        'Positive' Class : yes                
#>

Tidak jauh berbeda terkait metrix Accuracy nya, Recall nya pun menurun. Tapi jika di lihat pada Precision nya meningkat 1% dari 80 menjadi 81%. ini kurang bagus karena kita akan memilih metrix Recall pada kasus kali ini dan jika di lihat dari nilai Recall nya jelas turun dari pada model sebelumnya.Tetapi nanti kita lihat hasil di akhir nya setelah kita telah membuat model Random Forest nya.

Random Forest

Random Forest merupakan salah satu model (Ensemble Method) yang memiliki performa sangat baik. Ia juga dapat memilih feature selection secara otomatis dan terdapat juga Out of Bag sample yang digunakan sebagain pengganti Data Test. Jadi kita tek perlu repot-repot untuk membuat data train dan test. Tetapi kekurangan nya ia adalah model black-box yang artinya ia tidak dapat di interpertasikan. Dan Training Cost nya sangat besar (dari hardware maupun waktu).

Mari kita coba lihat dimensi data kita

dim(bank)

#> [1] 11162    17

Kita memiliki kolom yang cukup banyak, alangkah baik nya jika kita memfilter mana variable yang memiliki variance tinggi atau banyak informasi yang terkandung di dalam nya. Sehingga kolom dengan variance yang rendah (mendekati 0) akan kita take out, karena tidak mengandung banyak informasi yang berguna di dalam nya. Kita akan menggunakan fungsi nearZeroVar.

library(caret)
no_var <- nearZeroVar(bank)
bank_clean <- bank[, -no_var]
dim(bank_clean)

#> [1] 11162    15

Terlihat sekarang kita sudah memiliki 15 kolom dari 17 kolom sebelumnya

Mari kita lakukan Cross Validation, sebenarnya tidak wajib pada random forest melakukan splitting data test dan train, tapi tidak apa-apa juga kalau kita ingin melakukan nya.

set.seed(100)
index_two <- sample(x = nrow(bank_clean), size= nrow(bank_clean)*0.8)
bank_train_rf <- bank_clean[index_two, ]
bank_test_rf <- bank_clean[-index_two,]

Kemudian kita disini kita akan menggunakan k-fold validation untuk melakukan cross validation pada model random forest kita. Jadi apa k-fold validation itu ? K-fold validation merupakan teknik Cross Validation yangakan membagi datanya sebanyak k bagian, setiap bagian akan digunakan menjadi data test secara bergantian. Dan pada akhirnya akan di ambil Fold dengan Mtry (prediktor) yang memiliki Accuracy paling tinggi.

Kita disini akan menggunakan 5 K-fold dan di repeat sebanyak 3 kali dengan percobaan Mtry (prediktor) yang berbeda.

# kita akan menggunakan k-fold validation untuk melakukan cross validation pada model random forest kita
set.seed(417)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# pembuatan model random forest, disini model nya tidak akan saya running karena sudah saya training sebelumya dan telah saya simpan pada Objeck RDS 

# model_bank_rf <- train(deposit~., data = bank_train_rf, method = "rf",
#                        trainControl = control)

Mari kita panggil model Random Forest yang sebelunya telah saya latih pada Objek RDS nya.

bank_forest <- readRDS("bank_forest.RDS")

Saya disini menggunakan gambar dari pada Output finalModel dari random forest nya di karenakan akan mengeluarkan Output yang sangat panjang jika saya tulis Manual (bank_forest$finalModel) nya

Dapat kita lihat diatas. hasil akhir model random forest kita menggunakan mtry (variable prediktor) sebanyak 21 dan Error dari Out of Bag (data testing) sebesar 14.46% yang artinya Accuracy kita pada data test sebesar (100 - 14.46) = 85%

Mari kita coba untuk mengevaluasi model random forest nya menggunakan Confusion Matrix

bank_forest_pred <- predict(object = bank_forest, newdata = bank_test_rf)

conf_forest <- confusionMatrix(data = bank_forest_pred, reference = bank_test_rf$deposit, positive = "yes")
conf_forest

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  989 123
#>        yes 201 920
#>                                                
#>                Accuracy : 0.8549               
#>                  95% CI : (0.8396, 0.8693)     
#>     No Information Rate : 0.5329               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7099               
#>                                                
#>  Mcnemar's Test P-Value : 0.00001888           
#>                                                
#>             Sensitivity : 0.8821               
#>             Specificity : 0.8311               
#>          Pos Pred Value : 0.8207               
#>          Neg Pred Value : 0.8894               
#>              Prevalence : 0.4671               
#>          Detection Rate : 0.4120               
#>    Detection Prevalence : 0.5020               
#>       Balanced Accuracy : 0.8566               
#>                                                
#>        'Positive' Class : yes                  
#>

Dari hasil model Decision Tree kita di atas, dapat di simpulkan: - Recall / Sensitivity => Yang benar di prediksi positif dari yang realita nya (aktual nya) positif oleh Model kita ada 88% - Specifity => Yang benar di prediksi negatif dari yang realita nya (aktual nya) negatif oleh Model kita ada 83% - Precission Yang benar di prediksi positif dari yang di prediksi nya - positif oleh Model kita ada 82% - Accuracy => Seberapa banyak yang benar di prediksi dari keseluruhan data (positif maupun negatif) oleh model kita ada 85%

Mari kita coba lihat Plot ROC dan nilai AUC nya

bank_forest_pred_prob <- predict(object = bank_forest, newdata = bank_test_rf, type="prob")

forest_roc <- data.frame(pred_proba= bank_forest_pred_prob[, "yes"],
                        actual=as.factor(ifelse(bank_test_rf$deposit=="yes", 1, 0)))

roc_pred <- prediction(predictions = forest_roc$pred_proba, 
                       labels = forest_roc$actual)

# buat plot dari objek prediction
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))

auc_value <- performance(prediction.obj = roc_pred, measure = "auc")
auc_value@y.values # untuk menampilkan nilai AUC dari Plot ROC

#> [[1]]
#> [1] 0.9219277

Di lihat dari plot ROC nya, untuk False Positive Rate nya sudah mendakati 0, yan dimana itu artinya Kesalahn model kita di dalam memprediksi sudah semakin sedikit / kecil. Kemudian terkait True Positive Rate nya sudah mendekati 1, yang artinya model kita sudah semakin baik didalam memprediksi kelas Positif maupun Negatif. Di lihat dari nilai AUC nya pun sudah mendakati 1 (92%), dengan kata lain performansi model kita sudah semakin baik di dalam mengklasifikasikan kelas Positif maupun Negatif.

Mari kita lihat Importance Variable yang terdepat pada model kita. Importance Variable ini dapat di artikan sebagin variable yang dapat mengurangi Error pada model kita melakukan Splitting (pemisahan pada pohon) dan meningkatkan Accuracy pada model kita. Tapi perlu di ingat tidak selamanya Variable Importance ini benar, terkadang bisa juga misleading, jadi di sesuaikan saja dengan kebutuhan bisnis nya, jika varialbe tersebut masuk ke dalam kategori variable yang importance tapi pada kenyataan nya tidak, maka bisa di take out saja. Misalnya seperti ID, sudah jelas ID tidak mungkin kita ikut sertakan di dalam model kita, tapi kalau misalkan di kategori kan kedalam variable importance maka bisa di take out.

varImp(bank_forest) %>% plot()

Duration merupakan variable yang paling berpengaruh terhadap model Random Forest kita.

Mari coba lihat posisi threshold terbaik pada model kita

predict_prob_forest <- data.frame(negative = round(bank_forest_pred_prob[,1],4), positive = round(bank_forest_pred_prob[,2],4))

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                       prob = predict_prob_forest$positive, 
                       ref = as.factor(ifelse(bank_test_rf$deposit == "yes", 1, 0)), 
                       postarget = "1", 
                       negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
                "Accuracy" = result[,2],
                "Precision" = result[,3],
                "Specificity" = result[,4],
                "Cutoff" = co) %>% 
           gather(key = "Metrics", value = "value", 1:4) %>% 
           ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
           geom_line(lwd = 1.5) +
           scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
           scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
           scale_x_continuous(breaks = seq(0,1,0.1)) +
           labs(title = "Tradeoff Model Perfomance") +
           theme_minimal() +
           theme(legend.position = "top",
                 panel.grid.minor.y = element_blank(),
                 panel.grid.minor.x = element_blank()))

Seperti nya 0.5 sudah cukup baik untuk model kita, jadi tidak perlu di ubah lagi :)

Conclusion

naive <- data_frame(Accuracy = conf_naive$overall[1],
           Recall = conf_naive$byClass[1],
           Specificity = conf_naive$byClass[2],
           Precision = conf_naive$byClass[3])

dtree <- data_frame(Accuracy = conf_tree$overall[1],
           Recall = conf_tree$byClass[1],
           Specificity = conf_tree$byClass[2],
           Precision = conf_tree$byClass[3])

rforest <- data_frame(Accuracy = conf_forest$overall[1],
           Recall = conf_forest$byClass[1],
           Specificity = conf_forest$byClass[2],
           Precision = conf_forest$byClass[3])

rbind("Naive Bayes" = naive, "Decision Tree" = dtree, "Random Forest" = rforest)

#> # A tibble: 3 × 4
#>   Accuracy Recall Specificity Precision
#> *    <dbl>  <dbl>       <dbl>     <dbl>
#> 1    0.774  0.735       0.808     0.771
#> 2    0.836  0.850       0.823     0.808
#> 3    0.855  0.882       0.831     0.821

Berdasarkan hasil ke-empat metrix diatas. Saya akan memilih Random Forest karena dari segi Accuracy dan Recall yang lebih tinggi dari Decisio Tree dan Naive Bayes. Di samping itu juga karena pada kasus ini saya ingin menggunakan Metrix Recall guna mengurangin kelas False Negatif yang artinya saya ingin melakukan prediksi sebanyak-banyak nya terkait nasabah yang ingin melakukan Deposito dengan kata lain saya ingin mengurangi Nasabah yang melakukan Deposito tetapi di prediksi tidak melakuakan Deposito. Karena dengan informasi tersebut saya bisa mengekstrak kriteria dari nasabah yang melakukan Deposito lalu dapat menerapkan strategi campaign yang mampu menarik ketertarikan dari pada para nasabah di bank tempat saya bekerja.