Bank Classification With Some Algorithm
Intro
Import Library
library(tidyverse)
library(lubridate)
library(caret)
library(e1071)
library(partykit)
library(randomForest)
library(car)
library(kableExtra)## Warning: package 'kableExtra' was built under R version 4.0.4
Read Data
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
Age : Usia Nasabah Nasabah Job : Tipe Pekerjaan Calon Nasabah
Marital : Status Pernikahan calon nasabah
Education : Status Pendidikan Calon nasabah
Default: Apakah Memiliki Kredit Pinjaman
Housing: Apakah Memiliki Kredit Rumah
Loan: Apakah memiliki utang pribadi
Contact: Alat Menghubungi Calon Nasabah
Month: Bulan menghubungi calon nasabah
Dayofweek: Hari menghubungi calon nasabah
Duration: Durasi Terakhir ketika menghubungi Calon Nasabah
Campaign: Berapa kali menghubungi calon nasabah
Pdays: berapa hari calon nasabah menjadi nasabah setelah kontak terakhir, 999 brarti gagal
Previous: berapa kali calon nasabah dihubungi dalam marketing sebelumnya
Poutcome: hasil ketika marketing sebelumnya
Emp.var.rate: employment variation rate - quarterly indicator
Cons.price.idx: consumer price index - monthly indicator
Cons.conf.idx: consumer confidence index - monthly indicator
Euribor3m: euribor 3 month rate - daily indicator
Nr.employed: number of employees - quarterly indicator
y - Apakah calon nasabah menjadi nasabah ‘yes’, ‘no’
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
Data Wrangling and EDA
##
## no yes
## 0.8873349 0.1126651
Modelling
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
index <- sample(nrow(bank_upsampling), nrow(bank_upsampling)*0.75)
train <- bank_upsampling[index, ]
test <- bank_upsampling[-index, ]
unique(train$default)## [1] unknown no
## Levels: no unknown yes
## [1] no unknown
## Levels: no unknown yes
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.0068 -0.3582 -0.0844 0.4629 2.9879
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.280e+02 6.836e+01 -3.335 0.000853 ***
## age -4.490e-03 4.626e-03 -0.971 0.331773
## jobblue-collar -1.919e-01 1.516e-01 -1.265 0.205695
## jobentrepreneur -1.080e-02 2.329e-01 -0.046 0.963031
## jobhousemaid 1.274e-01 2.685e-01 0.474 0.635150
## jobmanagement -2.114e-01 1.609e-01 -1.314 0.188801
## jobretired 8.060e-01 2.169e-01 3.716 0.000202 ***
## jobself-employed -2.806e-01 2.258e-01 -1.243 0.213907
## jobservices -2.208e-01 1.636e-01 -1.350 0.177139
## jobstudent 3.432e-01 2.342e-01 1.466 0.142769
## jobtechnician -9.875e-02 1.360e-01 -0.726 0.467803
## jobunemployed -5.693e-02 2.455e-01 -0.232 0.816656
## jobunknown 1.063e-01 4.528e-01 0.235 0.814435
## maritalmarried 9.525e-02 1.279e-01 0.745 0.456442
## maritalsingle 2.657e-01 1.466e-01 1.813 0.069905 .
## maritalunknown 1.381e+00 9.537e-01 1.448 0.147557
## educationbasic.6y -9.957e-02 2.283e-01 -0.436 0.662703
## educationbasic.9y -9.869e-03 1.790e-01 -0.055 0.956028
## educationhigh.school 7.153e-02 1.769e-01 0.404 0.685976
## educationilliterate 1.265e+01 2.264e+02 0.056 0.955452
## educationprofessional.course 1.056e-01 1.930e-01 0.547 0.584236
## educationuniversity.degree 2.781e-01 1.778e-01 1.564 0.117870
## educationunknown 9.598e-02 2.268e-01 0.423 0.672207
## defaultunknown -2.254e-01 1.239e-01 -1.819 0.068920 .
## housingunknown -1.307e-01 2.703e-01 -0.483 0.628789
## housingyes -1.009e-02 7.943e-02 -0.127 0.898945
## loanunknown NA NA NA NA
## loanyes 8.192e-03 1.104e-01 0.074 0.940836
## contacttelephone -2.620e-01 1.444e-01 -1.815 0.069597 .
## monthaug 1.072e+00 2.505e-01 4.281 1.86e-05 ***
## monthdec -2.376e-01 4.360e-01 -0.545 0.585795
## monthjul -3.749e-02 1.850e-01 -0.203 0.839389
## monthjun -9.485e-01 2.312e-01 -4.102 4.09e-05 ***
## monthmar 2.268e+00 3.040e-01 7.462 8.54e-14 ***
## monthmay -9.892e-01 1.553e-01 -6.370 1.89e-10 ***
## monthnov -7.511e-01 2.295e-01 -3.273 0.001065 **
## monthoct 4.904e-01 3.059e-01 1.603 0.108915
## monthsep 1.087e-01 3.378e-01 0.322 0.747485
## day_of_weekmon 2.085e-02 1.242e-01 0.168 0.866657
## day_of_weekthu -4.259e-02 1.231e-01 -0.346 0.729251
## day_of_weektue 1.387e-01 1.265e-01 1.097 0.272839
## day_of_weekwed 2.115e-01 1.248e-01 1.695 0.090046 .
## duration 7.660e-03 2.110e-04 36.301 < 2e-16 ***
## campaign -4.624e-02 2.176e-02 -2.125 0.033575 *
## pdays -5.247e-04 5.042e-04 -1.041 0.297986
## previous 9.629e-02 1.513e-01 0.636 0.524638
## poutcomenonexistent 6.452e-01 2.081e-01 3.101 0.001932 **
## poutcomesuccess 1.423e+00 4.976e-01 2.859 0.004246 **
## emp.var.rate -2.201e+00 2.563e-01 -8.587 < 2e-16 ***
## cons.price.idx 2.224e+00 4.528e-01 4.912 9.03e-07 ***
## cons.conf.idx -4.415e-03 1.618e-02 -0.273 0.785021
## euribor3m 6.483e-01 2.515e-01 2.578 0.009948 **
## nr.employed 2.783e-03 5.642e-03 0.493 0.621857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9648.6 on 6959 degrees of freedom
## Residual deviance: 4357.6 on 6908 degrees of freedom
## AIC: 4461.6
##
## Number of Fisher Scoring iterations: 12
## Start: AIC=4461.59
## y ~ age + job + marital + education + default + housing + loan +
## contact + month + day_of_week + duration + campaign + pdays +
## previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed
##
## Df Deviance AIC
## - education 7 4366.1 4456.1
## - day_of_week 4 4363.5 4459.5
## - loan 1 4357.6 4459.6
## - housing 1 4357.6 4459.6
## - cons.conf.idx 1 4357.7 4459.7
## - nr.employed 1 4357.8 4459.8
## - previous 1 4358.0 4460.0
## - age 1 4358.5 4460.5
## - pdays 1 4358.7 4460.7
## - marital 3 4363.6 4461.6
## <none> 4357.6 4461.6
## - contact 1 4360.9 4462.9
## - default 1 4360.9 4462.9
## - campaign 1 4362.3 4464.3
## - euribor3m 1 4364.3 4466.3
## - job 11 4389.2 4471.2
## - poutcome 2 4372.3 4472.3
## - cons.price.idx 1 4381.7 4483.7
## - emp.var.rate 1 4431.0 4533.0
## - month 9 4696.0 4782.0
## - duration 1 7486.2 7588.2
##
## Step: AIC=4456.08
## y ~ age + job + marital + default + housing + loan + contact +
## month + day_of_week + duration + campaign + pdays + previous +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed
##
## Df Deviance AIC
## - day_of_week 4 4371.7 4453.7
## - loan 1 4366.1 4454.1
## - housing 1 4366.1 4454.1
## - cons.conf.idx 1 4366.2 4454.2
## - nr.employed 1 4366.3 4454.3
## - previous 1 4366.6 4454.6
## - pdays 1 4367.1 4455.1
## - age 1 4367.5 4455.5
## <none> 4366.1 4456.1
## - marital 3 4372.7 4456.7
## - contact 1 4369.3 4457.3
## - default 1 4370.0 4458.0
## - campaign 1 4370.4 4458.4
## - euribor3m 1 4372.8 4460.8
## - poutcome 2 4381.8 4467.8
## - job 11 4400.6 4468.6
## - cons.price.idx 1 4389.8 4477.8
## - emp.var.rate 1 4438.6 4526.6
## - month 9 4713.4 4785.4
## - duration 1 7495.0 7583.0
##
## Step: AIC=4453.68
## y ~ age + job + marital + default + housing + loan + contact +
## month + duration + campaign + pdays + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed
##
## Df Deviance AIC
## - loan 1 4371.7 4451.7
## - housing 1 4371.7 4451.7
## - nr.employed 1 4371.8 4451.8
## - cons.conf.idx 1 4371.8 4451.8
## - previous 1 4372.3 4452.3
## - pdays 1 4372.6 4452.6
## - age 1 4373.2 4453.2
## <none> 4371.7 4453.7
## - marital 3 4378.4 4454.4
## - contact 1 4374.6 4454.6
## - default 1 4375.6 4455.6
## - campaign 1 4376.4 4456.4
## - euribor3m 1 4379.2 4459.2
## - poutcome 2 4387.8 4465.8
## - job 11 4407.0 4467.0
## - cons.price.idx 1 4394.1 4474.1
## - emp.var.rate 1 4443.1 4523.1
## - month 9 4718.0 4782.0
## - duration 1 7505.6 7585.6
##
## Step: AIC=4451.68
## y ~ age + job + marital + default + housing + contact + month +
## duration + campaign + pdays + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - housing 2 4371.9 4447.9
## - nr.employed 1 4371.8 4449.8
## - cons.conf.idx 1 4371.8 4449.8
## - previous 1 4372.3 4450.3
## - pdays 1 4372.7 4450.7
## - age 1 4373.2 4451.2
## <none> 4371.7 4451.7
## - marital 3 4378.4 4452.4
## - contact 1 4374.6 4452.6
## - default 1 4375.7 4453.7
## - campaign 1 4376.4 4454.4
## - euribor3m 1 4379.2 4457.2
## - poutcome 2 4387.8 4463.8
## - job 11 4407.0 4465.0
## - cons.price.idx 1 4394.1 4472.1
## - emp.var.rate 1 4443.2 4521.2
## - month 9 4718.3 4780.3
## - duration 1 7506.0 7584.0
##
## Step: AIC=4447.92
## y ~ age + job + marital + default + contact + month + duration +
## campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - nr.employed 1 4372.0 4446.0
## - cons.conf.idx 1 4372.1 4446.1
## - previous 1 4372.5 4446.5
## - pdays 1 4372.9 4446.9
## - age 1 4373.4 4447.4
## <none> 4371.9 4447.9
## - marital 3 4378.6 4448.6
## - contact 1 4374.8 4448.8
## - default 1 4375.9 4449.9
## - campaign 1 4376.6 4450.6
## - euribor3m 1 4379.4 4453.4
## - poutcome 2 4388.1 4460.1
## - job 11 4407.2 4461.2
## - cons.price.idx 1 4394.3 4468.3
## - emp.var.rate 1 4443.4 4517.4
## - month 9 4719.0 4777.0
## - duration 1 7510.6 7584.6
##
## Step: AIC=4446.01
## y ~ age + job + marital + default + contact + month + duration +
## campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - previous 1 4372.6 4444.6
## - cons.conf.idx 1 4372.7 4444.7
## - pdays 1 4373.0 4445.0
## - age 1 4373.5 4445.5
## <none> 4372.0 4446.0
## - marital 3 4378.7 4446.7
## - contact 1 4374.8 4446.8
## - default 1 4376.0 4448.0
## - campaign 1 4376.7 4448.7
## - poutcome 2 4388.2 4458.2
## - job 11 4407.3 4459.3
## - euribor3m 1 4393.4 4465.4
## - cons.price.idx 1 4461.8 4533.8
## - emp.var.rate 1 4469.3 4541.3
## - month 9 4732.0 4788.0
## - duration 1 7512.4 7584.4
##
## Step: AIC=4444.61
## y ~ age + job + marital + default + contact + month + duration +
## campaign + pdays + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - cons.conf.idx 1 4373.3 4443.3
## - age 1 4374.1 4444.1
## <none> 4372.6 4444.6
## - pdays 1 4374.7 4444.7
## - marital 3 4379.3 4445.3
## - contact 1 4375.6 4445.6
## - default 1 4376.6 4446.6
## - campaign 1 4377.3 4447.3
## - job 11 4408.2 4458.2
## - euribor3m 1 4393.9 4463.9
## - poutcome 2 4399.9 4467.9
## - cons.price.idx 1 4465.4 4535.4
## - emp.var.rate 1 4470.3 4540.3
## - month 9 4733.1 4787.1
## - duration 1 7512.6 7582.6
##
## Step: AIC=4443.28
## y ~ age + job + marital + default + contact + month + duration +
## campaign + pdays + poutcome + emp.var.rate + cons.price.idx +
## euribor3m
##
## Df Deviance AIC
## - age 1 4374.9 4442.9
## <none> 4373.3 4443.3
## - pdays 1 4375.3 4443.3
## - marital 3 4379.8 4443.8
## - contact 1 4377.1 4445.1
## - default 1 4377.4 4445.4
## - campaign 1 4378.0 4446.0
## - job 11 4408.6 4456.6
## - euribor3m 1 4395.2 4463.2
## - poutcome 2 4401.0 4467.0
## - cons.price.idx 1 4470.0 4538.0
## - emp.var.rate 1 4485.5 4553.5
## - month 9 4766.8 4818.8
## - duration 1 7513.4 7581.4
##
## Step: AIC=4442.9
## y ~ job + marital + default + contact + month + duration + campaign +
## pdays + poutcome + emp.var.rate + cons.price.idx + euribor3m
##
## Df Deviance AIC
## <none> 4374.9 4442.9
## - pdays 1 4377.0 4443.0
## - contact 1 4378.7 4444.7
## - campaign 1 4379.8 4445.8
## - default 1 4379.9 4445.9
## - marital 3 4385.1 4447.1
## - job 11 4410.2 4456.2
## - euribor3m 1 4396.2 4462.2
## - poutcome 2 4402.9 4466.9
## - cons.price.idx 1 4470.2 4536.2
## - emp.var.rate 1 4485.8 4551.8
## - month 9 4766.8 4816.8
## - duration 1 7513.6 7579.6
Evaluation
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred1 <- as.factor(ifelse(pred1 >0.5, "yes", "no"))
confusionMatrix(pred1, reference = test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1008 111
## yes 147 1054
##
## Accuracy : 0.8888
## 95% CI : (0.8753, 0.9013)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7776
##
## Mcnemar's Test P-Value : 0.02933
##
## Sensitivity : 0.9047
## Specificity : 0.8727
## Pos Pred Value : 0.8776
## Neg Pred Value : 0.9008
## Prevalence : 0.5022
## Detection Rate : 0.4543
## Detection Prevalence : 0.5177
## Balanced Accuracy : 0.8887
##
## 'Positive' Class : yes
##
Create New Data and New Model
Hanya Ambil Variable yang memiliki korelasi tinggi berdasarkan summary model 1data_new <- bank_upsampling
data_new <- data_new %>%
select(c(y, age, default, contact, month, duration, poutcome, emp.var.rate, cons.price.idx, euribor3m))## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
input <- sample(nrow(data_new), nrow(data_new)*0.75)
trainn <- data_new[input, ]
testt <- data_new[-input, ]##
## Call:
## glm(formula = y ~ ., family = "binomial", data = trainn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.0326 -0.3621 -0.1261 0.4692 2.9102
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.027e+02 1.973e+01 -10.272 < 2e-16 ***
## age -1.157e-03 3.144e-03 -0.368 0.71283
## defaultunknown -3.272e-01 1.196e-01 -2.736 0.00621 **
## contacttelephone -2.898e-01 1.342e-01 -2.158 0.03089 *
## monthaug 1.048e+00 1.739e-01 6.028 1.66e-09 ***
## monthdec -3.826e-01 4.019e-01 -0.952 0.34102
## monthjul -8.161e-02 1.746e-01 -0.467 0.64019
## monthjun -9.398e-01 1.956e-01 -4.804 1.55e-06 ***
## monthmar 2.220e+00 2.589e-01 8.576 < 2e-16 ***
## monthmay -1.100e+00 1.382e-01 -7.964 1.67e-15 ***
## monthnov -7.830e-01 1.981e-01 -3.954 7.70e-05 ***
## monthoct 4.702e-01 2.728e-01 1.724 0.08476 .
## monthsep -4.681e-02 2.362e-01 -0.198 0.84290
## duration 7.558e-03 2.070e-04 36.510 < 2e-16 ***
## poutcomenonexistent 5.356e-01 1.181e-01 4.535 5.77e-06 ***
## poutcomesuccess 1.971e+00 1.974e-01 9.985 < 2e-16 ***
## emp.var.rate -2.138e+00 1.952e-01 -10.957 < 2e-16 ***
## cons.price.idx 2.104e+00 2.069e-01 10.171 < 2e-16 ***
## euribor3m 7.028e-01 1.448e-01 4.853 1.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9648.6 on 6959 degrees of freedom
## Residual deviance: 4427.9 on 6941 degrees of freedom
## AIC: 4465.9
##
## Number of Fisher Scoring iterations: 6
pred2 <- predict(model2, newdata = testt, type = "response")
pred2 <- as.factor(ifelse(pred2 >0.5, "yes", "no"))
confusionMatrix(pred2, reference = testt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1000 111
## yes 155 1054
##
## Accuracy : 0.8853
## 95% CI : (0.8717, 0.898)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7706
##
## Mcnemar's Test P-Value : 0.008377
##
## Sensitivity : 0.9047
## Specificity : 0.8658
## Pos Pred Value : 0.8718
## Neg Pred Value : 0.9001
## Prevalence : 0.5022
## Detection Rate : 0.4543
## Detection Prevalence : 0.5211
## Balanced Accuracy : 0.8853
##
## 'Positive' Class : yes
##
Try Another Algorithm
Decision Tree
modeldtree <- ctree(y~., data = trainn)
preddtree <- predict(modeldtree, newdata = testt)
confusionMatrix(preddtree, reference = testt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 998 107
## yes 157 1058
##
## Accuracy : 0.8862
## 95% CI : (0.8726, 0.8989)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7724
##
## Mcnemar's Test P-Value : 0.002564
##
## Sensitivity : 0.9082
## Specificity : 0.8641
## Pos Pred Value : 0.8708
## Neg Pred Value : 0.9032
## Prevalence : 0.5022
## Detection Rate : 0.4560
## Detection Prevalence : 0.5237
## Balanced Accuracy : 0.8861
##
## 'Positive' Class : yes
##
modeldtreee <- ctree(y~., data = trainn,
control = ctree_control(mincriterion = 0.95,
minsplit = 5,
minbucket = 5))
preddtreee <- predict(modeldtreee, newdata = testt)
confusionMatrix(preddtreee, reference = testt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 998 103
## yes 157 1062
##
## Accuracy : 0.8879
## 95% CI : (0.8744, 0.9005)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7758
##
## Mcnemar's Test P-Value : 0.001013
##
## Sensitivity : 0.9116
## Specificity : 0.8641
## Pos Pred Value : 0.8712
## Neg Pred Value : 0.9064
## Prevalence : 0.5022
## Detection Rate : 0.4578
## Detection Prevalence : 0.5254
## Balanced Accuracy : 0.8878
##
## 'Positive' Class : yes
##
Random Forest
modelrf <- randomForest(y~., data = trainn, importance = TRUE, ntree = 500)
predrf <- predict(modelrf, newdata = testt)
confusionMatrix(predrf, reference = testt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 993 70
## yes 162 1095
##
## Accuracy : 0.9
## 95% CI : (0.8871, 0.9119)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7999
##
## Mcnemar's Test P-Value : 2.309e-09
##
## Sensitivity : 0.9399
## Specificity : 0.8597
## Pos Pred Value : 0.8711
## Neg Pred Value : 0.9341
## Prevalence : 0.5022
## Detection Rate : 0.4720
## Detection Prevalence : 0.5418
## Balanced Accuracy : 0.8998
##
## 'Positive' Class : yes
##
#set.seed(833)
#ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 5)
#bank_forest <- train(y~., data = train, method = "rf", trControl = ctrl)
#saveRDS(bank_forest, "bank_forest.RDS") # simpan modelmodel4 <- readRDS("bank_forest.RDS")
pred4 <- predict(model4, newdata = test)
confusionMatrix(pred4, reference = test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1000 16
## yes 155 1149
##
## Accuracy : 0.9263
## 95% CI : (0.9149, 0.9366)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9863
## Specificity : 0.8658
## Pos Pred Value : 0.8811
## Neg Pred Value : 0.9843
## Prevalence : 0.5022
## Detection Rate : 0.4953
## Detection Prevalence : 0.5621
## Balanced Accuracy : 0.9260
##
## 'Positive' Class : yes
##
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 11.09%
## Confusion matrix:
## no yes class.error
## no 2921 547 0.15772780
## yes 225 3267 0.06443299
## rf variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## duration 100.000
## nr.employed 28.400
## euribor3m 25.345
## age 11.831
## emp.var.rate 8.480
## cons.conf.idx 8.262
## campaign 4.975
## cons.price.idx 4.204
## pdays 2.801
## monthoct 2.469
## monthmay 2.237
## housingyes 1.880
## educationuniversity.degree 1.673
## day_of_weekthu 1.628
## day_of_weekwed 1.596
## day_of_weekmon 1.585
## educationhigh.school 1.568
## defaultunknown 1.433
## jobblue-collar 1.423
## previous 1.422
New Data Using Top 10 Predictors from Random Forrest
bank_final <- bank_upsampling
bank_final <- bank_final %>%
select(duration, nr.employed, euribor3m, age, emp.var.rate, cons.conf.idx, cons.price.idx, campaign, pdays, month, y)## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
insta <- sample(nrow(bank_final), nrow(bank_final)*0.75)
trainnn <- bank_final[insta, ]
testtt <- bank_final[-insta, ]Glm
model5 <- glm(y~., family = "binomial", data = trainnn)
pred5 <- predict(model5, newdata = testtt, type = "response")
pred5 <- as.factor(ifelse(pred5 >0.5, "yes", "no"))
confusionMatrix(pred5, reference = testtt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 992 115
## yes 163 1050
##
## Accuracy : 0.8802
## 95% CI : (0.8663, 0.8931)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7603
##
## Mcnemar's Test P-Value : 0.004819
##
## Sensitivity : 0.9013
## Specificity : 0.8589
## Pos Pred Value : 0.8656
## Neg Pred Value : 0.8961
## Prevalence : 0.5022
## Detection Rate : 0.4526
## Detection Prevalence : 0.5228
## Balanced Accuracy : 0.8801
##
## 'Positive' Class : yes
##
DecisionTree
model6 <- ctree(y~.,data = trainnn)
pred6 <- predict(model6, newdata = testtt)
confusionMatrix(pred6, reference = testtt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 974 92
## yes 181 1073
##
## Accuracy : 0.8823
## 95% CI : (0.8685, 0.8952)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7646
##
## Mcnemar's Test P-Value : 1.004e-07
##
## Sensitivity : 0.9210
## Specificity : 0.8433
## Pos Pred Value : 0.8557
## Neg Pred Value : 0.9137
## Prevalence : 0.5022
## Detection Rate : 0.4625
## Detection Prevalence : 0.5405
## Balanced Accuracy : 0.8822
##
## 'Positive' Class : yes
##
RandomForest
modelrff <- randomForest(y~., data = trainnn)
predrff <- predict(modelrff, newdata = testtt)
confusionMatrix(predrff, reference = testtt$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 995 76
## yes 160 1089
##
## Accuracy : 0.8983
## 95% CI : (0.8853, 0.9103)
## No Information Rate : 0.5022
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7965
##
## Mcnemar's Test P-Value : 6.559e-08
##
## Sensitivity : 0.9348
## Specificity : 0.8615
## Pos Pred Value : 0.8719
## Neg Pred Value : 0.9290
## Prevalence : 0.5022
## Detection Rate : 0.4694
## Detection Prevalence : 0.5384
## Balanced Accuracy : 0.8981
##
## 'Positive' Class : yes
##
Try Cluster
library(factoextra)
bank2 <- bank_upsampling
bank2 <- bank2 %>%
select(c(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed))
bank2scale <- scale(bank2)## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
bankcluster <- kmeans(x = as.data.frame(bank2scale), centers = 4)
bank2$cluster <- bankcluster$clusterbank3 <- cbind(bank_upsampling, cluster = bank2$cluster)
bank3 <- bank3 %>%
select(-c(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed)) %>%
mutate(cluster = as.factor(cluster))## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(867)
indo <- sample(nrow(bank3), nrow(bank3)*0.75)
traina <- bank3[indo, ]
testa <- bank3[-indo, ]Glm
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
preda <- as.factor(ifelse(preda >0.5, "yes", "no"))
confusionMatrix(preda, reference = testa$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1028 241
## yes 96 955
##
## Accuracy : 0.8547
## 95% CI : (0.8397, 0.8688)
## No Information Rate : 0.5155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7103
##
## Mcnemar's Test P-Value : 4.358e-15
##
## Sensitivity : 0.7985
## Specificity : 0.9146
## Pos Pred Value : 0.9087
## Neg Pred Value : 0.8101
## Prevalence : 0.5155
## Detection Rate : 0.4116
## Detection Prevalence : 0.4530
## Balanced Accuracy : 0.8565
##
## 'Positive' Class : yes
##
Decision Tree
dtreemodel <- ctree(y~., data = traina)
predaa <- predict(dtreemodel, newdata = testa)
confusionMatrix(predaa, reference = testa$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 964 114
## yes 160 1082
##
## Accuracy : 0.8819
## 95% CI : (0.8681, 0.8948)
## No Information Rate : 0.5155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7633
##
## Mcnemar's Test P-Value : 0.006557
##
## Sensitivity : 0.9047
## Specificity : 0.8577
## Pos Pred Value : 0.8712
## Neg Pred Value : 0.8942
## Prevalence : 0.5155
## Detection Rate : 0.4664
## Detection Prevalence : 0.5353
## Balanced Accuracy : 0.8812
##
## 'Positive' Class : yes
##
Random Forest
rfmodel <- randomForest(y~., data = traina)
rfpred <- predict(rfmodel, newdata = testa)
confusionMatrix(rfpred, reference = testa$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 955 75
## yes 169 1121
##
## Accuracy : 0.8948
## 95% CI : (0.8816, 0.907)
## No Information Rate : 0.5155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7889
##
## Mcnemar's Test P-Value : 2.621e-09
##
## Sensitivity : 0.9373
## Specificity : 0.8496
## Pos Pred Value : 0.8690
## Neg Pred Value : 0.9272
## Prevalence : 0.5155
## Detection Rate : 0.4832
## Detection Prevalence : 0.5560
## Balanced Accuracy : 0.8935
##
## 'Positive' Class : yes
##
set.seed(123)
#ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 5)
#bank_foresta <- train(y~., data = traina, method = "rf", trControl = ctrl)
#saveRDS(bank_foresta, "bank_forestt.RDS") # simpan modelmodelrds2 <- readRDS("bank_forestt.RDS")
predrds <- predict(modelrds2, newdata = testa, type = "raw")
confusionMatrix(predrds, reference = testa$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 853 519
## yes 271 677
##
## Accuracy : 0.6595
## 95% CI : (0.6398, 0.6788)
## No Information Rate : 0.5155
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3228
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5661
## Specificity : 0.7589
## Pos Pred Value : 0.7141
## Neg Pred Value : 0.6217
## Prevalence : 0.5155
## Detection Rate : 0.2918
## Detection Prevalence : 0.4086
## Balanced Accuracy : 0.6625
##
## 'Positive' Class : yes
##
## rf variable importance
##
## only 20 most important variables shown (out of 51)
##
## Overall
## duration 100.000
## cluster4 35.634
## age 14.341
## monthmay 13.336
## pdays 7.905
## campaign 6.180
## poutcomesuccess 3.826
## monthoct 2.716
## contacttelephone 2.703
## housingyes 2.357
## defaultunknown 2.240
## previous 2.116
## educationuniversity.degree 2.023
## jobblue-collar 1.932
## day_of_weekthu 1.887
## day_of_weekmon 1.885
## day_of_weektue 1.744
## maritalmarried 1.674
## educationhigh.school 1.661
## day_of_weekwed 1.615
Table
tabel <- data.frame(Model = c("Glm_Full", "GLM_BackwardPredictor", "DTree_BackwardPredictor1", "DTree_BackwardPredictor2" , "RF_BackwardPredictor", "RF_Full", "Glm10Predictor", "DTree10Predictor", "RF10Predictor", "GLMCluster", "DtreeCluster", "RFCluster", "RFClusterFull" ),
Accuracy = c(0.8694, 0.8621, 0.8724, 0.8728, 0.8858, 0.9203, 0.8672, 0.8776, 0.8888, 0.8461, 0.8672, 0.8823, 0.8759),
sensitivity = c(0.8807, 0.8800, 0.9276, 0.9294, 0.9294, 0.9854, 0.8824, 0.9279, 0.9253, 0.7935, 0.8871, 0.9089, 0.8955),
specificity = c(0.8580, 0.8450, 0.8197, 0.8189, 0.8441, 0.8545, 0.8519, 0.8268, 0.8519, 0.9021, 0.8256, 0.8336, 0.8345))
tabel %>%
kbl() %>%
kable_paper("hover", full_width = F)| Model | Accuracy | sensitivity | specificity |
|---|---|---|---|
| Glm_Full | 0.8694 | 0.8807 | 0.8580 |
| GLM_BackwardPredictor | 0.8621 | 0.8800 | 0.8450 |
| DTree_BackwardPredictor1 | 0.8724 | 0.9276 | 0.8197 |
| DTree_BackwardPredictor2 | 0.8728 | 0.9294 | 0.8189 |
| RF_BackwardPredictor | 0.8858 | 0.9294 | 0.8441 |
| RF_Full | 0.9203 | 0.9854 | 0.8545 |
| Glm10Predictor | 0.8672 | 0.8824 | 0.8519 |
| DTree10Predictor | 0.8776 | 0.9279 | 0.8268 |
| RF10Predictor | 0.8888 | 0.9253 | 0.8519 |
| GLMCluster | 0.8461 | 0.7935 | 0.9021 |
| DtreeCluster | 0.8672 | 0.8871 | 0.8256 |
| RFCluster | 0.8823 | 0.9089 | 0.8336 |
| RFClusterFull | 0.8759 | 0.8955 | 0.8345 |
Conclusion
Kesimpulan pertama yang bisa saya ambil dari model machine learning yang saya lakukan yaitu bahwa Variable Durasi, Bulan, Usia, dan variable yang berkaitan dengan kondisi ekonomi adalah variable yang paling berpengaruh dalam memprediksi berhasil atau tidaknya marketing tersebut.kesimpulan kedua yaitu dengan mengcluster variable yang berkaitan dengan kondisi ekonomi tidak membuat model yang dibuat menjadi lebih baik dalam memprediksi. dan dengan menghilangkan beberapa prediktor juga tidak membuat model tersebut menjadi lebih baik, karena model akan lebih baik jika menggunakan seluruh prediktor.
lalu dari semua model yang saya buat hampir semuanya lebih baik dalam memprediksi positive yaitu memprediksi yes, namun dalam memprediksi negative atau No tidak sebaik dalam memprediksi positif, kecuali pada model glm dari data cluster. sehingga masih perlu ada peningkatan dalam model ini, mungkin suatu saat apabila ilmu saya sudah bertambah akan saya buat lagi model baru yang lebih akurat