Regarding improving the numbers of sales from Bank of Portugal. In this moment, through the raw data can be colaborate with Machine Learning Classification. The purpose of this analysis is helping telemarketing to reach out the right customers. which means, the target of this sales is classifying the customers based on are they taken a loan or not. Hopefully, it can help Marketing Division to prioritize the customers with no loan yet.
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown,…
## $ marital <fct> married, single, married, married, single, married, single,…
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary,…
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, n…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may,…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
RNGkind(sample.kind = "Rounding")
set.seed(100)
# train-test splitting
index <- sample(nrow(bank), nrow(bank) * 0.75)
dep_train <- bank[index,] # training = 75%
dep_test <- bank[-index,] # testing = 25%cek proporsi kelas target :
##
## no yes
## 0.8828005 0.1171995
##
## no yes
## 0.8836592 0.1163408
Dari semua data di atas di dapatkan bahwa dari kedua dataset tersebut, lebih dari 80% belum memiliki deposito di bank tersebut
# train
loan_naive <- naiveBayes(x = dep_train %>% select(-y), # prediktor
y = dep_train$y) # target## balance
## dep_train$y [,1] [,2]
## no 1305.208 2974.823
## yes 1789.932 3353.306
Prediksi class dari data test dengan function predict():
loan_pred_class <- predict(object = loan_naive, newdata = dep_test, type = "class")
head(loan_pred_class)## [1] no no no no no no
## Levels: no yes
Evaluasi model dengan confusion matrix:
library(caret)
# confusion matrix
confusionMatrix(data = loan_pred_class, # label prediksi
reference = dep_test$y, # label actual
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9219 614
## yes 769 701
##
## Accuracy : 0.8776
## 95% CI : (0.8715, 0.8836)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 0.9772
##
## Kappa : 0.4339
##
## Mcnemar's Test P-Value : 3.457e-05
##
## Sensitivity : 0.53308
## Specificity : 0.92301
## Pos Pred Value : 0.47687
## Neg Pred Value : 0.93756
## Prevalence : 0.11634
## Detection Rate : 0.06202
## Detection Prevalence : 0.13005
## Balanced Accuracy : 0.72804
##
## 'Positive' Class : yes
##
confusionMatrix(data = loan_pred_class, # label prediksi
reference = dep_test$y, # label actual
positive = "no")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9219 614
## yes 769 701
##
## Accuracy : 0.8776
## 95% CI : (0.8715, 0.8836)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 0.9772
##
## Kappa : 0.4339
##
## Mcnemar's Test P-Value : 3.457e-05
##
## Sensitivity : 0.9230
## Specificity : 0.5331
## Pos Pred Value : 0.9376
## Neg Pred Value : 0.4769
## Prevalence : 0.8837
## Detection Rate : 0.8156
## Detection Prevalence : 0.8699
## Balanced Accuracy : 0.7280
##
## 'Positive' Class : no
##
the result from two class is same level. and will be used 8.77% accuracy. - False negative is higher than false positive - True positive is still highest numbers, it means accuracy from this model can be used by telemarketing to get new customers
# prediksi kelas di data test
pred_loan_test <- predict(object = dt_model,
newdata = dep_test,
type = "response")
# confusion matrix data test
confusionMatrix(data = pred_loan_test,
reference = dep_test$y,
positive = "no")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9648 724
## yes 340 591
##
## Accuracy : 0.9059
## 95% CI : (0.9003, 0.9112)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 1.922e-14
##
## Kappa : 0.4757
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9660
## Specificity : 0.4494
## Pos Pred Value : 0.9302
## Neg Pred Value : 0.6348
## Prevalence : 0.8837
## Detection Rate : 0.8536
## Detection Prevalence : 0.9176
## Balanced Accuracy : 0.7077
##
## 'Positive' Class : no
##
The result from Decision tree model, can be generate 90.5% accuracy. - False negative is lower than false positive - in terms of this case, we want to get true positive as much as we can. and expecting false negative with minimum numbers. in this model, can be generate highest accuracy with 96% sensitivity.
We have been evaluate from two models. In terms of Accuracy Decision Tree have a highest accuracy with 90% and sensitivity with 96%. Which means can be answering business needs for classifying the list of customers. the point of this analysis is classifying who has been taken a loan or not. Hopefully, strategy of marketing can use this analysis for making highest numbers of sales through True Positive customers annd confidence with 90% of accuracy on this model.