Bank Classification With Some Algorithm

Intro

Datasets yang saya gunakan adalah dana mengenai Marketing dari bank di Portugal, yaitu mengenai marketing produk deposito berjangka. di dalam data ini terdapat 21 variable prediktor dan 1 variable output, dan dalam 21 variable prediktor tersebut terbagi lagi menjadi tiga bagian, yaitu prediktor yang mengenai data pribadi calon nasabah, lalu ada prediktor yang berkenaan dengan cara bank marketing ke calon nasabah tersebut, dan yang terakhir yaitu prediktor yang berhubungan kondisi perekonomian saat itu. kelebihan dari dataset ini yaitu memiliki variable yang mumpuni untuk memprediksi apakah marketing yang dilakukan bank akan berhasil, serta memiliki sampel yang banyak untuk dijadikan alat dalam membuat model yang baik. namun kelemahannya yaitu dalam variable yang berhubungan dengan kondisi ekonomi tidak disertakan data mengenai suku bunga pada periode terkait, sehingga saya rasa kurang pas apabila data mengenai marketing produk deposito berjangka namun tidak memasukan elemen suku bunga.

Import Library

library(tidyverse)
library(lubridate)
library(caret)
library(e1071)
library(partykit)
library(randomForest)
library(car)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.0.4

Read Data

bank <- read.csv(file = "bank.csv", head = TRUE, sep=";", row.names = NULL)
head(bank)

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1         mon      261        1   999        0 nonexistent          1.1
## 2         mon      149        1   999        0 nonexistent          1.1
## 3         mon      226        1   999        0 nonexistent          1.1
## 4         mon      151        1   999        0 nonexistent          1.1
## 5         mon      307        1   999        0 nonexistent          1.1
## 6         mon      198        1   999        0 nonexistent          1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         93.994         -36.4     4.857        5191 no
## 2         93.994         -36.4     4.857        5191 no
## 3         93.994         -36.4     4.857        5191 no
## 4         93.994         -36.4     4.857        5191 no
## 5         93.994         -36.4     4.857        5191 no
## 6         93.994         -36.4     4.857        5191 no

str(bank)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

Age : Usia Nasabah Nasabah
Job : Tipe Pekerjaan Calon Nasabah
Marital : Status Pernikahan calon nasabah
Education : Status Pendidikan Calon nasabah
Default: Apakah Memiliki Kredit Pinjaman
Housing: Apakah Memiliki Kredit Rumah
Loan: Apakah memiliki utang pribadi
Contact: Alat Menghubungi Calon Nasabah
Month: Bulan menghubungi calon nasabah
Dayofweek: Hari menghubungi calon nasabah
Duration: Durasi Terakhir ketika menghubungi Calon Nasabah
Campaign: Berapa kali menghubungi calon nasabah
Pdays: berapa hari calon nasabah menjadi nasabah setelah kontak terakhir, 999 brarti gagal
Previous: berapa kali calon nasabah dihubungi dalam marketing sebelumnya
Poutcome: hasil ketika marketing sebelumnya
Emp.var.rate: employment variation rate - quarterly indicator
Cons.price.idx: consumer price index - monthly indicator
Cons.conf.idx: consumer confidence index - monthly indicator
Euribor3m: euribor 3 month rate - daily indicator
Nr.employed: number of employees - quarterly indicator
y - Apakah calon nasabah menjadi nasabah ‘yes’, ‘no’

colSums(is.na(bank))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

Data Wrangling and EDA

bank_clean <- bank %>% 
  mutate_if(~is.character(.), ~as.factor(.)) %>% 
  filter(duration != 0)

prop.table(table(bank_clean$y))

## 
##        no       yes 
## 0.8873349 0.1126651

Upsampling

bank_upsampling <- downSample(x = bank_clean %>% select(-y),
                              y = bank_clean$y,
                              yname = "y")
prop.table(table(bank_upsampling$y))

## 
##  no yes 
## 0.5 0.5

Modelling

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)

index <- sample(nrow(bank_upsampling), nrow(bank_upsampling)*0.75)
train <- bank_upsampling[index, ]
test <- bank_upsampling[-index, ]
unique(train$default)

## [1] unknown no     
## Levels: no unknown yes

unique(test$default)

## [1] no      unknown
## Levels: no unknown yes

model1 <- glm(y~., data = train, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0068  -0.3582  -0.0844   0.4629   2.9879  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.280e+02  6.836e+01  -3.335 0.000853 ***
## age                          -4.490e-03  4.626e-03  -0.971 0.331773    
## jobblue-collar               -1.919e-01  1.516e-01  -1.265 0.205695    
## jobentrepreneur              -1.080e-02  2.329e-01  -0.046 0.963031    
## jobhousemaid                  1.274e-01  2.685e-01   0.474 0.635150    
## jobmanagement                -2.114e-01  1.609e-01  -1.314 0.188801    
## jobretired                    8.060e-01  2.169e-01   3.716 0.000202 ***
## jobself-employed             -2.806e-01  2.258e-01  -1.243 0.213907    
## jobservices                  -2.208e-01  1.636e-01  -1.350 0.177139    
## jobstudent                    3.432e-01  2.342e-01   1.466 0.142769    
## jobtechnician                -9.875e-02  1.360e-01  -0.726 0.467803    
## jobunemployed                -5.693e-02  2.455e-01  -0.232 0.816656    
## jobunknown                    1.063e-01  4.528e-01   0.235 0.814435    
## maritalmarried                9.525e-02  1.279e-01   0.745 0.456442    
## maritalsingle                 2.657e-01  1.466e-01   1.813 0.069905 .  
## maritalunknown                1.381e+00  9.537e-01   1.448 0.147557    
## educationbasic.6y            -9.957e-02  2.283e-01  -0.436 0.662703    
## educationbasic.9y            -9.869e-03  1.790e-01  -0.055 0.956028    
## educationhigh.school          7.153e-02  1.769e-01   0.404 0.685976    
## educationilliterate           1.265e+01  2.264e+02   0.056 0.955452    
## educationprofessional.course  1.056e-01  1.930e-01   0.547 0.584236    
## educationuniversity.degree    2.781e-01  1.778e-01   1.564 0.117870    
## educationunknown              9.598e-02  2.268e-01   0.423 0.672207    
## defaultunknown               -2.254e-01  1.239e-01  -1.819 0.068920 .  
## housingunknown               -1.307e-01  2.703e-01  -0.483 0.628789    
## housingyes                   -1.009e-02  7.943e-02  -0.127 0.898945    
## loanunknown                          NA         NA      NA       NA    
## loanyes                       8.192e-03  1.104e-01   0.074 0.940836    
## contacttelephone             -2.620e-01  1.444e-01  -1.815 0.069597 .  
## monthaug                      1.072e+00  2.505e-01   4.281 1.86e-05 ***
## monthdec                     -2.376e-01  4.360e-01  -0.545 0.585795    
## monthjul                     -3.749e-02  1.850e-01  -0.203 0.839389    
## monthjun                     -9.485e-01  2.312e-01  -4.102 4.09e-05 ***
## monthmar                      2.268e+00  3.040e-01   7.462 8.54e-14 ***
## monthmay                     -9.892e-01  1.553e-01  -6.370 1.89e-10 ***
## monthnov                     -7.511e-01  2.295e-01  -3.273 0.001065 ** 
## monthoct                      4.904e-01  3.059e-01   1.603 0.108915    
## monthsep                      1.087e-01  3.378e-01   0.322 0.747485    
## day_of_weekmon                2.085e-02  1.242e-01   0.168 0.866657    
## day_of_weekthu               -4.259e-02  1.231e-01  -0.346 0.729251    
## day_of_weektue                1.387e-01  1.265e-01   1.097 0.272839    
## day_of_weekwed                2.115e-01  1.248e-01   1.695 0.090046 .  
## duration                      7.660e-03  2.110e-04  36.301  < 2e-16 ***
## campaign                     -4.624e-02  2.176e-02  -2.125 0.033575 *  
## pdays                        -5.247e-04  5.042e-04  -1.041 0.297986    
## previous                      9.629e-02  1.513e-01   0.636 0.524638    
## poutcomenonexistent           6.452e-01  2.081e-01   3.101 0.001932 ** 
## poutcomesuccess               1.423e+00  4.976e-01   2.859 0.004246 ** 
## emp.var.rate                 -2.201e+00  2.563e-01  -8.587  < 2e-16 ***
## cons.price.idx                2.224e+00  4.528e-01   4.912 9.03e-07 ***
## cons.conf.idx                -4.415e-03  1.618e-02  -0.273 0.785021    
## euribor3m                     6.483e-01  2.515e-01   2.578 0.009948 ** 
## nr.employed                   2.783e-03  5.642e-03   0.493 0.621857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9648.6  on 6959  degrees of freedom
## Residual deviance: 4357.6  on 6908  degrees of freedom
## AIC: 4461.6
## 
## Number of Fisher Scoring iterations: 12

backward <- step(model1, direction = "backward")

## Start:  AIC=4461.59
## y ~ age + job + marital + education + default + housing + loan + 
##     contact + month + day_of_week + duration + campaign + pdays + 
##     previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - education       7   4366.1 4456.1
## - day_of_week     4   4363.5 4459.5
## - loan            1   4357.6 4459.6
## - housing         1   4357.6 4459.6
## - cons.conf.idx   1   4357.7 4459.7
## - nr.employed     1   4357.8 4459.8
## - previous        1   4358.0 4460.0
## - age             1   4358.5 4460.5
## - pdays           1   4358.7 4460.7
## - marital         3   4363.6 4461.6
## <none>                4357.6 4461.6
## - contact         1   4360.9 4462.9
## - default         1   4360.9 4462.9
## - campaign        1   4362.3 4464.3
## - euribor3m       1   4364.3 4466.3
## - job            11   4389.2 4471.2
## - poutcome        2   4372.3 4472.3
## - cons.price.idx  1   4381.7 4483.7
## - emp.var.rate    1   4431.0 4533.0
## - month           9   4696.0 4782.0
## - duration        1   7486.2 7588.2
## 
## Step:  AIC=4456.08
## y ~ age + job + marital + default + housing + loan + contact + 
##     month + day_of_week + duration + campaign + pdays + previous + 
##     poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - day_of_week     4   4371.7 4453.7
## - loan            1   4366.1 4454.1
## - housing         1   4366.1 4454.1
## - cons.conf.idx   1   4366.2 4454.2
## - nr.employed     1   4366.3 4454.3
## - previous        1   4366.6 4454.6
## - pdays           1   4367.1 4455.1
## - age             1   4367.5 4455.5
## <none>                4366.1 4456.1
## - marital         3   4372.7 4456.7
## - contact         1   4369.3 4457.3
## - default         1   4370.0 4458.0
## - campaign        1   4370.4 4458.4
## - euribor3m       1   4372.8 4460.8
## - poutcome        2   4381.8 4467.8
## - job            11   4400.6 4468.6
## - cons.price.idx  1   4389.8 4477.8
## - emp.var.rate    1   4438.6 4526.6
## - month           9   4713.4 4785.4
## - duration        1   7495.0 7583.0
## 
## Step:  AIC=4453.68
## y ~ age + job + marital + default + housing + loan + contact + 
##     month + duration + campaign + pdays + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed
## 
##                  Df Deviance    AIC
## - loan            1   4371.7 4451.7
## - housing         1   4371.7 4451.7
## - nr.employed     1   4371.8 4451.8
## - cons.conf.idx   1   4371.8 4451.8
## - previous        1   4372.3 4452.3
## - pdays           1   4372.6 4452.6
## - age             1   4373.2 4453.2
## <none>                4371.7 4453.7
## - marital         3   4378.4 4454.4
## - contact         1   4374.6 4454.6
## - default         1   4375.6 4455.6
## - campaign        1   4376.4 4456.4
## - euribor3m       1   4379.2 4459.2
## - poutcome        2   4387.8 4465.8
## - job            11   4407.0 4467.0
## - cons.price.idx  1   4394.1 4474.1
## - emp.var.rate    1   4443.1 4523.1
## - month           9   4718.0 4782.0
## - duration        1   7505.6 7585.6
## 
## Step:  AIC=4451.68
## y ~ age + job + marital + default + housing + contact + month + 
##     duration + campaign + pdays + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - housing         2   4371.9 4447.9
## - nr.employed     1   4371.8 4449.8
## - cons.conf.idx   1   4371.8 4449.8
## - previous        1   4372.3 4450.3
## - pdays           1   4372.7 4450.7
## - age             1   4373.2 4451.2
## <none>                4371.7 4451.7
## - marital         3   4378.4 4452.4
## - contact         1   4374.6 4452.6
## - default         1   4375.7 4453.7
## - campaign        1   4376.4 4454.4
## - euribor3m       1   4379.2 4457.2
## - poutcome        2   4387.8 4463.8
## - job            11   4407.0 4465.0
## - cons.price.idx  1   4394.1 4472.1
## - emp.var.rate    1   4443.2 4521.2
## - month           9   4718.3 4780.3
## - duration        1   7506.0 7584.0
## 
## Step:  AIC=4447.92
## y ~ age + job + marital + default + contact + month + duration + 
##     campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - nr.employed     1   4372.0 4446.0
## - cons.conf.idx   1   4372.1 4446.1
## - previous        1   4372.5 4446.5
## - pdays           1   4372.9 4446.9
## - age             1   4373.4 4447.4
## <none>                4371.9 4447.9
## - marital         3   4378.6 4448.6
## - contact         1   4374.8 4448.8
## - default         1   4375.9 4449.9
## - campaign        1   4376.6 4450.6
## - euribor3m       1   4379.4 4453.4
## - poutcome        2   4388.1 4460.1
## - job            11   4407.2 4461.2
## - cons.price.idx  1   4394.3 4468.3
## - emp.var.rate    1   4443.4 4517.4
## - month           9   4719.0 4777.0
## - duration        1   7510.6 7584.6
## 
## Step:  AIC=4446.01
## y ~ age + job + marital + default + contact + month + duration + 
##     campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - previous        1   4372.6 4444.6
## - cons.conf.idx   1   4372.7 4444.7
## - pdays           1   4373.0 4445.0
## - age             1   4373.5 4445.5
## <none>                4372.0 4446.0
## - marital         3   4378.7 4446.7
## - contact         1   4374.8 4446.8
## - default         1   4376.0 4448.0
## - campaign        1   4376.7 4448.7
## - poutcome        2   4388.2 4458.2
## - job            11   4407.3 4459.3
## - euribor3m       1   4393.4 4465.4
## - cons.price.idx  1   4461.8 4533.8
## - emp.var.rate    1   4469.3 4541.3
## - month           9   4732.0 4788.0
## - duration        1   7512.4 7584.4
## 
## Step:  AIC=4444.61
## y ~ age + job + marital + default + contact + month + duration + 
##     campaign + pdays + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - cons.conf.idx   1   4373.3 4443.3
## - age             1   4374.1 4444.1
## <none>                4372.6 4444.6
## - pdays           1   4374.7 4444.7
## - marital         3   4379.3 4445.3
## - contact         1   4375.6 4445.6
## - default         1   4376.6 4446.6
## - campaign        1   4377.3 4447.3
## - job            11   4408.2 4458.2
## - euribor3m       1   4393.9 4463.9
## - poutcome        2   4399.9 4467.9
## - cons.price.idx  1   4465.4 4535.4
## - emp.var.rate    1   4470.3 4540.3
## - month           9   4733.1 4787.1
## - duration        1   7512.6 7582.6
## 
## Step:  AIC=4443.28
## y ~ age + job + marital + default + contact + month + duration + 
##     campaign + pdays + poutcome + emp.var.rate + cons.price.idx + 
##     euribor3m
## 
##                  Df Deviance    AIC
## - age             1   4374.9 4442.9
## <none>                4373.3 4443.3
## - pdays           1   4375.3 4443.3
## - marital         3   4379.8 4443.8
## - contact         1   4377.1 4445.1
## - default         1   4377.4 4445.4
## - campaign        1   4378.0 4446.0
## - job            11   4408.6 4456.6
## - euribor3m       1   4395.2 4463.2
## - poutcome        2   4401.0 4467.0
## - cons.price.idx  1   4470.0 4538.0
## - emp.var.rate    1   4485.5 4553.5
## - month           9   4766.8 4818.8
## - duration        1   7513.4 7581.4
## 
## Step:  AIC=4442.9
## y ~ job + marital + default + contact + month + duration + campaign + 
##     pdays + poutcome + emp.var.rate + cons.price.idx + euribor3m
## 
##                  Df Deviance    AIC
## <none>                4374.9 4442.9
## - pdays           1   4377.0 4443.0
## - contact         1   4378.7 4444.7
## - campaign        1   4379.8 4445.8
## - default         1   4379.9 4445.9
## - marital         3   4385.1 4447.1
## - job            11   4410.2 4456.2
## - euribor3m       1   4396.2 4462.2
## - poutcome        2   4402.9 4466.9
## - cons.price.idx  1   4470.2 4536.2
## - emp.var.rate    1   4485.8 4551.8
## - month           9   4766.8 4816.8
## - duration        1   7513.6 7579.6

Evaluation

pred1 <- predict(model1, newdata = test, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

pred1 <- as.factor(ifelse(pred1 >0.5, "yes", "no"))
confusionMatrix(pred1, reference = test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1008  111
##        yes  147 1054
##                                           
##                Accuracy : 0.8888          
##                  95% CI : (0.8753, 0.9013)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7776          
##                                           
##  Mcnemar's Test P-Value : 0.02933         
##                                           
##             Sensitivity : 0.9047          
##             Specificity : 0.8727          
##          Pos Pred Value : 0.8776          
##          Neg Pred Value : 0.9008          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4543          
##    Detection Prevalence : 0.5177          
##       Balanced Accuracy : 0.8887          
##                                           
##        'Positive' Class : yes             
##

Create New Data and New Model

Hanya Ambil Variable yang memiliki korelasi tinggi berdasarkan summary model 1

data_new <- bank_upsampling
data_new <- data_new %>%
  select(c(y, age, default, contact, month, duration, poutcome, emp.var.rate, cons.price.idx, euribor3m))

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)

input <- sample(nrow(data_new), nrow(data_new)*0.75)
trainn <- data_new[input, ]
testt <- data_new[-input, ]

model2 <- glm(y~., data = trainn, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = trainn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0326  -0.3621  -0.1261   0.4692   2.9102  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.027e+02  1.973e+01 -10.272  < 2e-16 ***
## age                 -1.157e-03  3.144e-03  -0.368  0.71283    
## defaultunknown      -3.272e-01  1.196e-01  -2.736  0.00621 ** 
## contacttelephone    -2.898e-01  1.342e-01  -2.158  0.03089 *  
## monthaug             1.048e+00  1.739e-01   6.028 1.66e-09 ***
## monthdec            -3.826e-01  4.019e-01  -0.952  0.34102    
## monthjul            -8.161e-02  1.746e-01  -0.467  0.64019    
## monthjun            -9.398e-01  1.956e-01  -4.804 1.55e-06 ***
## monthmar             2.220e+00  2.589e-01   8.576  < 2e-16 ***
## monthmay            -1.100e+00  1.382e-01  -7.964 1.67e-15 ***
## monthnov            -7.830e-01  1.981e-01  -3.954 7.70e-05 ***
## monthoct             4.702e-01  2.728e-01   1.724  0.08476 .  
## monthsep            -4.681e-02  2.362e-01  -0.198  0.84290    
## duration             7.558e-03  2.070e-04  36.510  < 2e-16 ***
## poutcomenonexistent  5.356e-01  1.181e-01   4.535 5.77e-06 ***
## poutcomesuccess      1.971e+00  1.974e-01   9.985  < 2e-16 ***
## emp.var.rate        -2.138e+00  1.952e-01 -10.957  < 2e-16 ***
## cons.price.idx       2.104e+00  2.069e-01  10.171  < 2e-16 ***
## euribor3m            7.028e-01  1.448e-01   4.853 1.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9648.6  on 6959  degrees of freedom
## Residual deviance: 4427.9  on 6941  degrees of freedom
## AIC: 4465.9
## 
## Number of Fisher Scoring iterations: 6

pred2 <- predict(model2, newdata = testt, type = "response")
  pred2 <- as.factor(ifelse(pred2 >0.5, "yes", "no"))
confusionMatrix(pred2, reference = testt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1000  111
##        yes  155 1054
##                                          
##                Accuracy : 0.8853         
##                  95% CI : (0.8717, 0.898)
##     No Information Rate : 0.5022         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7706         
##                                          
##  Mcnemar's Test P-Value : 0.008377       
##                                          
##             Sensitivity : 0.9047         
##             Specificity : 0.8658         
##          Pos Pred Value : 0.8718         
##          Neg Pred Value : 0.9001         
##              Prevalence : 0.5022         
##          Detection Rate : 0.4543         
##    Detection Prevalence : 0.5211         
##       Balanced Accuracy : 0.8853         
##                                          
##        'Positive' Class : yes            
##

Try Another Algorithm

Decision Tree

modeldtree <- ctree(y~., data = trainn)
preddtree <- predict(modeldtree, newdata = testt)
confusionMatrix(preddtree, reference = testt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   998  107
##        yes  157 1058
##                                           
##                Accuracy : 0.8862          
##                  95% CI : (0.8726, 0.8989)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7724          
##                                           
##  Mcnemar's Test P-Value : 0.002564        
##                                           
##             Sensitivity : 0.9082          
##             Specificity : 0.8641          
##          Pos Pred Value : 0.8708          
##          Neg Pred Value : 0.9032          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4560          
##    Detection Prevalence : 0.5237          
##       Balanced Accuracy : 0.8861          
##                                           
##        'Positive' Class : yes             
##

modeldtreee <- ctree(y~., data = trainn, 
                     control = ctree_control(mincriterion = 0.95,
                                             minsplit = 5,
                                             minbucket = 5))
preddtreee <- predict(modeldtreee, newdata = testt)
confusionMatrix(preddtreee, reference = testt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   998  103
##        yes  157 1062
##                                           
##                Accuracy : 0.8879          
##                  95% CI : (0.8744, 0.9005)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7758          
##                                           
##  Mcnemar's Test P-Value : 0.001013        
##                                           
##             Sensitivity : 0.9116          
##             Specificity : 0.8641          
##          Pos Pred Value : 0.8712          
##          Neg Pred Value : 0.9064          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4578          
##    Detection Prevalence : 0.5254          
##       Balanced Accuracy : 0.8878          
##                                           
##        'Positive' Class : yes             
##

Random Forest

modelrf <- randomForest(y~., data = trainn, importance = TRUE, ntree = 500)
predrf <- predict(modelrf, newdata = testt)
confusionMatrix(predrf, reference = testt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   993   70
##        yes  162 1095
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8871, 0.9119)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7999          
##                                           
##  Mcnemar's Test P-Value : 2.309e-09       
##                                           
##             Sensitivity : 0.9399          
##             Specificity : 0.8597          
##          Pos Pred Value : 0.8711          
##          Neg Pred Value : 0.9341          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4720          
##    Detection Prevalence : 0.5418          
##       Balanced Accuracy : 0.8998          
##                                           
##        'Positive' Class : yes             
##

#set.seed(833)

#ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 5)

#bank_forest <- train(y~., data = train, method = "rf", trControl = ctrl)

#saveRDS(bank_forest, "bank_forest.RDS") # simpan model

model4 <- readRDS("bank_forest.RDS")
pred4 <- predict(model4, newdata = test)
confusionMatrix(pred4, reference = test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1000   16
##        yes  155 1149
##                                           
##                Accuracy : 0.9263          
##                  95% CI : (0.9149, 0.9366)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9863          
##             Specificity : 0.8658          
##          Pos Pred Value : 0.8811          
##          Neg Pred Value : 0.9843          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4953          
##    Detection Prevalence : 0.5621          
##       Balanced Accuracy : 0.9260          
##                                           
##        'Positive' Class : yes             
##

model4$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 11.09%
## Confusion matrix:
##       no  yes class.error
## no  2921  547  0.15772780
## yes  225 3267  0.06443299

varImp(model4)

## rf variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                            Overall
## duration                   100.000
## nr.employed                 28.400
## euribor3m                   25.345
## age                         11.831
## emp.var.rate                 8.480
## cons.conf.idx                8.262
## campaign                     4.975
## cons.price.idx               4.204
## pdays                        2.801
## monthoct                     2.469
## monthmay                     2.237
## housingyes                   1.880
## educationuniversity.degree   1.673
## day_of_weekthu               1.628
## day_of_weekwed               1.596
## day_of_weekmon               1.585
## educationhigh.school         1.568
## defaultunknown               1.433
## jobblue-collar               1.423
## previous                     1.422

New Data Using Top 10 Predictors from Random Forrest

bank_final <- bank_upsampling
bank_final <- bank_final %>% 
  select(duration, nr.employed, euribor3m, age, emp.var.rate, cons.conf.idx, cons.price.idx, campaign, pdays, month, y)

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)

insta <- sample(nrow(bank_final), nrow(bank_final)*0.75)
trainnn <- bank_final[insta, ]
testtt <- bank_final[-insta, ]

Glm

model5 <- glm(y~., family = "binomial", data = trainnn)
pred5 <- predict(model5, newdata = testtt, type = "response")
pred5 <- as.factor(ifelse(pred5 >0.5, "yes", "no"))
confusionMatrix(pred5, reference = testtt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   992  115
##        yes  163 1050
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8663, 0.8931)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7603          
##                                           
##  Mcnemar's Test P-Value : 0.004819        
##                                           
##             Sensitivity : 0.9013          
##             Specificity : 0.8589          
##          Pos Pred Value : 0.8656          
##          Neg Pred Value : 0.8961          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4526          
##    Detection Prevalence : 0.5228          
##       Balanced Accuracy : 0.8801          
##                                           
##        'Positive' Class : yes             
##

DecisionTree

model6 <- ctree(y~.,data = trainnn)
pred6 <- predict(model6, newdata = testtt)
confusionMatrix(pred6, reference = testtt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   974   92
##        yes  181 1073
##                                           
##                Accuracy : 0.8823          
##                  95% CI : (0.8685, 0.8952)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7646          
##                                           
##  Mcnemar's Test P-Value : 1.004e-07       
##                                           
##             Sensitivity : 0.9210          
##             Specificity : 0.8433          
##          Pos Pred Value : 0.8557          
##          Neg Pred Value : 0.9137          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4625          
##    Detection Prevalence : 0.5405          
##       Balanced Accuracy : 0.8822          
##                                           
##        'Positive' Class : yes             
##

RandomForest

modelrff <- randomForest(y~., data = trainnn)
predrff <- predict(modelrff, newdata = testtt)
confusionMatrix(predrff, reference = testtt$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   995   76
##        yes  160 1089
##                                           
##                Accuracy : 0.8983          
##                  95% CI : (0.8853, 0.9103)
##     No Information Rate : 0.5022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7965          
##                                           
##  Mcnemar's Test P-Value : 6.559e-08       
##                                           
##             Sensitivity : 0.9348          
##             Specificity : 0.8615          
##          Pos Pred Value : 0.8719          
##          Neg Pred Value : 0.9290          
##              Prevalence : 0.5022          
##          Detection Rate : 0.4694          
##    Detection Prevalence : 0.5384          
##       Balanced Accuracy : 0.8981          
##                                           
##        'Positive' Class : yes             
##

Try Cluster

library(factoextra)
bank2 <- bank_upsampling
bank2 <- bank2 %>% 
  select(c(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed))

bank2scale <- scale(bank2)

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)

bankcluster <- kmeans(x = as.data.frame(bank2scale), centers = 4)
bank2$cluster <- bankcluster$cluster

bank3 <- cbind(bank_upsampling, cluster = bank2$cluster)
bank3 <- bank3 %>% 
  select(-c(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed)) %>% 
  mutate(cluster = as.factor(cluster))

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(867)

indo <- sample(nrow(bank3), nrow(bank3)*0.75)
traina <- bank3[indo, ]
testa <- bank3[-indo, ]

Glm

modela <- glm(y~., data = traina, family = "binomial")
preda <- predict(modela, newdata = testa)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

preda <- as.factor(ifelse(preda >0.5, "yes", "no"))
confusionMatrix(preda, reference = testa$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1028  241
##        yes   96  955
##                                           
##                Accuracy : 0.8547          
##                  95% CI : (0.8397, 0.8688)
##     No Information Rate : 0.5155          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7103          
##                                           
##  Mcnemar's Test P-Value : 4.358e-15       
##                                           
##             Sensitivity : 0.7985          
##             Specificity : 0.9146          
##          Pos Pred Value : 0.9087          
##          Neg Pred Value : 0.8101          
##              Prevalence : 0.5155          
##          Detection Rate : 0.4116          
##    Detection Prevalence : 0.4530          
##       Balanced Accuracy : 0.8565          
##                                           
##        'Positive' Class : yes             
##

Decision Tree

dtreemodel <- ctree(y~., data = traina)
predaa <- predict(dtreemodel, newdata = testa)
confusionMatrix(predaa, reference = testa$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   964  114
##        yes  160 1082
##                                           
##                Accuracy : 0.8819          
##                  95% CI : (0.8681, 0.8948)
##     No Information Rate : 0.5155          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7633          
##                                           
##  Mcnemar's Test P-Value : 0.006557        
##                                           
##             Sensitivity : 0.9047          
##             Specificity : 0.8577          
##          Pos Pred Value : 0.8712          
##          Neg Pred Value : 0.8942          
##              Prevalence : 0.5155          
##          Detection Rate : 0.4664          
##    Detection Prevalence : 0.5353          
##       Balanced Accuracy : 0.8812          
##                                           
##        'Positive' Class : yes             
##

Random Forest

rfmodel <- randomForest(y~., data = traina)
rfpred <- predict(rfmodel, newdata = testa)
confusionMatrix(rfpred, reference = testa$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no   955   75
##        yes  169 1121
##                                          
##                Accuracy : 0.8948         
##                  95% CI : (0.8816, 0.907)
##     No Information Rate : 0.5155         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7889         
##                                          
##  Mcnemar's Test P-Value : 2.621e-09      
##                                          
##             Sensitivity : 0.9373         
##             Specificity : 0.8496         
##          Pos Pred Value : 0.8690         
##          Neg Pred Value : 0.9272         
##              Prevalence : 0.5155         
##          Detection Rate : 0.4832         
##    Detection Prevalence : 0.5560         
##       Balanced Accuracy : 0.8935         
##                                          
##        'Positive' Class : yes            
##

set.seed(123)

#ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 5)

#bank_foresta <- train(y~., data = traina, method = "rf", trControl = ctrl)

#saveRDS(bank_foresta, "bank_forestt.RDS") # simpan model

modelrds2 <- readRDS("bank_forestt.RDS")
predrds <- predict(modelrds2, newdata = testa, type = "raw")
confusionMatrix(predrds, reference = testa$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  853 519
##        yes 271 677
##                                           
##                Accuracy : 0.6595          
##                  95% CI : (0.6398, 0.6788)
##     No Information Rate : 0.5155          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3228          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5661          
##             Specificity : 0.7589          
##          Pos Pred Value : 0.7141          
##          Neg Pred Value : 0.6217          
##              Prevalence : 0.5155          
##          Detection Rate : 0.2918          
##    Detection Prevalence : 0.4086          
##       Balanced Accuracy : 0.6625          
##                                           
##        'Positive' Class : yes             
##

varImp(modelrds2)

## rf variable importance
## 
##   only 20 most important variables shown (out of 51)
## 
##                            Overall
## duration                   100.000
## cluster4                    35.634
## age                         14.341
## monthmay                    13.336
## pdays                        7.905
## campaign                     6.180
## poutcomesuccess              3.826
## monthoct                     2.716
## contacttelephone             2.703
## housingyes                   2.357
## defaultunknown               2.240
## previous                     2.116
## educationuniversity.degree   2.023
## jobblue-collar               1.932
## day_of_weekthu               1.887
## day_of_weekmon               1.885
## day_of_weektue               1.744
## maritalmarried               1.674
## educationhigh.school         1.661
## day_of_weekwed               1.615

Table

tabel <- data.frame(Model = c("Glm_Full", "GLM_BackwardPredictor", "DTree_BackwardPredictor1", "DTree_BackwardPredictor2" , "RF_BackwardPredictor", "RF_Full", "Glm10Predictor", "DTree10Predictor", "RF10Predictor", "GLMCluster", "DtreeCluster", "RFCluster", "RFClusterFull" ),
                    Accuracy = c(0.8694, 0.8621, 0.8724, 0.8728, 0.8858, 0.9203, 0.8672, 0.8776, 0.8888, 0.8461, 0.8672, 0.8823, 0.8759),
                    sensitivity = c(0.8807, 0.8800, 0.9276, 0.9294, 0.9294, 0.9854, 0.8824, 0.9279, 0.9253, 0.7935, 0.8871, 0.9089, 0.8955),
                    specificity = c(0.8580, 0.8450, 0.8197, 0.8189, 0.8441, 0.8545, 0.8519, 0.8268, 0.8519, 0.9021, 0.8256, 0.8336, 0.8345))
tabel %>% 
  kbl() %>% 
  kable_paper("hover", full_width = F)

Model	Accuracy	sensitivity	specificity
Glm_Full	0.8694	0.8807	0.8580
GLM_BackwardPredictor	0.8621	0.8800	0.8450
DTree_BackwardPredictor1	0.8724	0.9276	0.8197
DTree_BackwardPredictor2	0.8728	0.9294	0.8189
RF_BackwardPredictor	0.8858	0.9294	0.8441
RF_Full	0.9203	0.9854	0.8545
Glm10Predictor	0.8672	0.8824	0.8519
DTree10Predictor	0.8776	0.9279	0.8268
RF10Predictor	0.8888	0.9253	0.8519
GLMCluster	0.8461	0.7935	0.9021
DtreeCluster	0.8672	0.8871	0.8256
RFCluster	0.8823	0.9089	0.8336
RFClusterFull	0.8759	0.8955	0.8345

Conclusion

Kesimpulan pertama yang bisa saya ambil dari model machine learning yang saya lakukan yaitu bahwa Variable Durasi, Bulan, Usia, dan variable yang berkaitan dengan kondisi ekonomi adalah variable yang paling berpengaruh dalam memprediksi berhasil atau tidaknya marketing tersebut.

kesimpulan kedua yaitu dengan mengcluster variable yang berkaitan dengan kondisi ekonomi tidak membuat model yang dibuat menjadi lebih baik dalam memprediksi. dan dengan menghilangkan beberapa prediktor juga tidak membuat model tersebut menjadi lebih baik, karena model akan lebih baik jika menggunakan seluruh prediktor.
lalu dari semua model yang saya buat hampir semuanya lebih baik dalam memprediksi positive yaitu memprediksi yes, namun dalam memprediksi negative atau No tidak sebaik dalam memprediksi positif, kecuali pada model glm dari data cluster. sehingga masih perlu ada peningkatan dalam model ini, mungkin suatu saat apabila ilmu saya sudah bertambah akan saya buat lagi model baru yang lebih akurat