Background

Regarding improving the numbers of sales from Bank of Portugal. In this moment, through the raw data can be colaborate with Machine Learning Classification. The purpose of this analysis is helping telemarketing to reach out the right customers. which means, the target of this sales is classifying the customers based on are they taken a loan or not. Hopefully, it can help Marketing Division to prioritize the customers with no loan yet.

Data Pre-processing

Import library

library(dplyr)
library(grid)
library(gtools)
library(e1071)
library(tm)
library(SnowballC)
library(ROCR)
library(partykit)
library(caret)
library(class)
library(gmodels)

Load data set from bank in portugal

bank <- read.csv("bank.csv", sep = ";", stringsAsFactors = T)
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown,…
## $ marital   <fct> married, single, married, married, single, married, single,…
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary,…
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, n…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may,…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Exploratory data analysis

summary(bank)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 

Check missing value

anyNA(bank)
## [1] FALSE

Check level of target

levels(bank$y)
## [1] "no"  "yes"

Cross-Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)

# train-test splitting
index <- sample(nrow(bank), nrow(bank) * 0.75)
dep_train <- bank[index,] # training = 75%
dep_test <- bank[-index,] # testing = 25%

cek proporsi kelas target :

prop.table(table(dep_train$y))
## 
##        no       yes 
## 0.8828005 0.1171995
prop.table(table(dep_test$y))
## 
##        no       yes 
## 0.8836592 0.1163408

Dari semua data di atas di dapatkan bahwa dari kedua dataset tersebut, lebih dari 80% belum memiliki deposito di bank tersebut

Naive Bayes Model

Model fitting

# train
loan_naive <- naiveBayes(x = dep_train %>% select(-y), # prediktor
                          y = dep_train$y) # target
loan_naive$tables$balance
##            balance
## dep_train$y     [,1]     [,2]
##         no  1305.208 2974.823
##         yes 1789.932 3353.306

Model Evaluation

Prediksi class dari data test dengan function predict():

loan_pred_class <- predict(object = loan_naive, newdata = dep_test, type = "class")
head(loan_pred_class)
## [1] no no no no no no
## Levels: no yes

Evaluasi model dengan confusion matrix:

library(caret)

# confusion matrix
confusionMatrix(data = loan_pred_class, # label prediksi
                reference = dep_test$y, # label actual
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9219  614
##        yes  769  701
##                                           
##                Accuracy : 0.8776          
##                  95% CI : (0.8715, 0.8836)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 0.9772          
##                                           
##                   Kappa : 0.4339          
##                                           
##  Mcnemar's Test P-Value : 3.457e-05       
##                                           
##             Sensitivity : 0.53308         
##             Specificity : 0.92301         
##          Pos Pred Value : 0.47687         
##          Neg Pred Value : 0.93756         
##              Prevalence : 0.11634         
##          Detection Rate : 0.06202         
##    Detection Prevalence : 0.13005         
##       Balanced Accuracy : 0.72804         
##                                           
##        'Positive' Class : yes             
## 
confusionMatrix(data = loan_pred_class, # label prediksi
                reference = dep_test$y, # label actual
                positive = "no")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9219  614
##        yes  769  701
##                                           
##                Accuracy : 0.8776          
##                  95% CI : (0.8715, 0.8836)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 0.9772          
##                                           
##                   Kappa : 0.4339          
##                                           
##  Mcnemar's Test P-Value : 3.457e-05       
##                                           
##             Sensitivity : 0.9230          
##             Specificity : 0.5331          
##          Pos Pred Value : 0.9376          
##          Neg Pred Value : 0.4769          
##              Prevalence : 0.8837          
##          Detection Rate : 0.8156          
##    Detection Prevalence : 0.8699          
##       Balanced Accuracy : 0.7280          
##                                           
##        'Positive' Class : no              
## 

the result from two class is same level. and will be used 8.77% accuracy. - False negative is higher than false positive - True positive is still highest numbers, it means accuracy from this model can be used by telemarketing to get new customers

Decision tree model

Model fitting

dt_model <- ctree(y~., dep_train)

Visualize Decision Tree model

plot(dt_model, type="simple")

Model Evaluation

# prediksi kelas di data test
pred_loan_test <- predict(object = dt_model, 
                          newdata = dep_test,
                          type = "response")

# confusion matrix data test
confusionMatrix(data = pred_loan_test,
                reference = dep_test$y,
                positive = "no")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9648  724
##        yes  340  591
##                                           
##                Accuracy : 0.9059          
##                  95% CI : (0.9003, 0.9112)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1.922e-14       
##                                           
##                   Kappa : 0.4757          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9660          
##             Specificity : 0.4494          
##          Pos Pred Value : 0.9302          
##          Neg Pred Value : 0.6348          
##              Prevalence : 0.8837          
##          Detection Rate : 0.8536          
##    Detection Prevalence : 0.9176          
##       Balanced Accuracy : 0.7077          
##                                           
##        'Positive' Class : no              
## 

The result from Decision tree model, can be generate 90.5% accuracy. - False negative is lower than false positive - in terms of this case, we want to get true positive as much as we can. and expecting false negative with minimum numbers. in this model, can be generate highest accuracy with 96% sensitivity.

Summary

We have been evaluate from two models. In terms of Accuracy Decision Tree have a highest accuracy with 90% and sensitivity with 96%. Which means can be answering business needs for classifying the list of customers. the point of this analysis is classifying who has been taken a loan or not. Hopefully, strategy of marketing can use this analysis for making highest numbers of sales through True Positive customers annd confidence with 90% of accuracy on this model.