Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: ‘cellular’,‘telephone’)
9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 - day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Library Set up and Read Data

# libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(DMwR)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(rsample)
library(ROCR)
library(class)
library(e1071)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
bank <- read.csv("bank-full.csv", sep = ";")
head(bank)

Data Wrangling

bank <- bank %>% 
  mutate_if(is.character,as.factor)
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown,…
## $ marital   <fct> married, single, married, married, single, married, single,…
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary,…
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, n…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may,…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Check Missing Value

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Your dataset is ready to go!

Check Probability of Target

prop.table(table(bank$y))
## 
##        no       yes 
## 0.8830152 0.1169848

Cross Validation (Split Data Train & Test)

set.seed(1234)

index <- initial_split(bank, 0.80, strata = 'y')
data.train <- training(index)
data.test <- testing(index)

Balancing Target Variable Using SMOTE()

data.train <- SMOTE(y~. , data.train)
prop.table(table(data.train$y))
## 
##        no       yes 
## 0.5714286 0.4285714

Create Model

Naive Bayes

model_naive_bank <- naiveBayes(data.train %>% select(-y),
                           data.train$y)

Preiction of model_naive_bank

pred_class_bank <- predict(model_naive_bank, data.test %>% select(-y))

Model Evaluation

confusionMatrix(pred_class_bank, data.test$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6158  242
##        yes 1826  815
##                                           
##                Accuracy : 0.7713          
##                  95% CI : (0.7625, 0.7799)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3287          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.77105         
##             Specificity : 0.77129         
##          Pos Pred Value : 0.30860         
##          Neg Pred Value : 0.96219         
##              Prevalence : 0.11691         
##          Detection Rate : 0.09014         
##    Detection Prevalence : 0.29211         
##       Balanced Accuracy : 0.77117         
##                                           
##        'Positive' Class : yes             
## 

ROC & AUC Evaluation

y_pred_prob <- predict(object = model_naive_bank, 
                           newdata = data.test, 
                           type = "raw")

Creating data frame of probability and actual predictions

bank_roc <- data.frame(pred_prob = y_pred_prob[, 'yes'],
           actual = ifelse(data.test$y == 'yes', 1, 0))
head(bank_roc)

Plotting ROC

bank_roc_naive <- prediction(predictions = bank_roc$pred_prob,
                        labels = bank_roc$actual)

# ROC curve
plot(performance(bank_roc_naive, "tpr", "fpr"))

ACU (y.values)

bank_auc <- performance(bank_roc_naive, measure = "auc")
bank_auc@y.values
## [[1]]
## [1] 0.8298038

Naive Bayes Model Interpretation

model_naive_bank
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = data.train %>% select(-y), y = data.train$y)
## 
## A-priori probabilities:
## data.train$y
##        no       yes 
## 0.5714286 0.4285714 
## 
## Conditional probabilities:
##             age
## data.train$y     [,1]     [,2]
##          no  40.73169 10.12745
##          yes 41.52915 11.31746
## 
##             job
## data.train$y      admin. blue-collar entrepreneur   housemaid  management
##          no  0.112712665 0.225602552  0.036862004 0.029064272 0.207939509
##          yes 0.146266541 0.155324512  0.022526780 0.020163831 0.233931947
##             job
## data.train$y     retired self-employed    services     student  technician
##          no  0.041174386   0.033672023 0.094163516 0.018312854 0.167355860
##          yes 0.079552615   0.038752363 0.078764965 0.035444234 0.143588532
##             job
## data.train$y  unemployed     unknown
##          no  0.027173913 0.005966446
##          yes 0.041194077 0.004489603
## 
##             marital
## data.train$y  divorced   married    single
##          no  0.1126536 0.6156073 0.2717391
##          yes 0.1829710 0.4773945 0.3396345
## 
##             education
## data.train$y    primary  secondary   tertiary    unknown
##          no  0.15754962 0.51565454 0.28644849 0.04034735
##          yes 0.12326717 0.45061437 0.36980151 0.05631695
## 
##             default
## data.train$y         no        yes
##          no  0.98003308 0.01996692
##          yes 0.92107750 0.07892250
## 
##             balance
## data.train$y     [,1]     [,2]
##          no  1318.758 3200.056
##          yes 1518.115 2797.331
## 
##             housing
## data.train$y        no       yes
##          no  0.4181829 0.5818171
##          yes 0.5360744 0.4639256
## 
##             loan
## data.train$y        no       yes
##          no  0.8290997 0.1709003
##          yes 0.6724165 0.3275835
## 
##             contact
## data.train$y   cellular  telephone    unknown
##          no  0.62062854 0.06149575 0.31787571
##          yes 0.66052300 0.13516068 0.20431632
## 
##             day
## data.train$y     [,1]     [,2]
##          no  15.85462 8.284512
##          yes 14.99764 7.992567
## 
##             month
## data.train$y         apr         aug         dec         feb         jan
##          no  0.057065217 0.136401229 0.003189981 0.054938563 0.028768904
##          yes 0.097668557 0.101843100 0.023708255 0.072936358 0.021502836
##             month
## data.train$y         jul         jun         mar         may         nov
##          no  0.162157372 0.121337429 0.006261815 0.323605860 0.087901701
##          yes 0.121061752 0.137287335 0.040327662 0.198487713 0.075456837
##             month
## data.train$y         oct         sep
##          no  0.011046786 0.007325142
##          yes 0.064508507 0.045211090
## 
##             duration
## data.train$y     [,1]     [,2]
##          no  222.8318 213.3509
##          yes 564.5240 347.3338
## 
##             campaign
## data.train$y     [,1]     [,2]
##          no  2.816871 3.200866
##          yes 2.118878 1.594061
## 
##             pdays
## data.train$y     [,1]      [,2]
##          no  36.09204  96.76664
##          yes 77.34469 105.21821
## 
##             previous
## data.train$y      [,1]     [,2]
##          no  0.5098062 3.421808
##          yes 1.1417570 1.990264
## 
##             poutcome
## data.train$y    failure      other    success    unknown
##          no  0.10698251 0.03733459 0.01299622 0.84268667
##          yes 0.14366730 0.06175173 0.17926906 0.61531191

We can give our interpretation for every feature by the result of their own dependent probability with the target, for example :

Probability of customer that have loan to make a purchase during the campaign are 0.32

Decision Tree Model

bank_dtree <- ctree(formula = y ~ ., data.train, control = ctree_control(minbucket = 50))

Prediction

bank_predict_test <- predict(bank_dtree, data.test, type = 'response')

Model Evaluation

confusionMatrix(bank_predict_test,data.test$y, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6920  249
##        yes 1064  808
##                                          
##                Accuracy : 0.8548         
##                  95% CI : (0.8473, 0.862)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.473          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.76443        
##             Specificity : 0.86673        
##          Pos Pred Value : 0.43162        
##          Neg Pred Value : 0.96527        
##              Prevalence : 0.11691        
##          Detection Rate : 0.08937        
##    Detection Prevalence : 0.20706        
##       Balanced Accuracy : 0.81558        
##                                          
##        'Positive' Class : yes            
## 

ROC

y_pred_prob_dtree <- predict(object = bank_dtree, 
                           newdata = data.test, 
                           type = "prob")
data_roc_dtree <- data.frame(pred_prob = y_pred_prob_dtree[, 'yes'],
           actual = ifelse(data.test$y == 'yes', 1, 0))
head(data_roc_dtree)

ROC Plotting

bank_roc_dtree <- prediction(predictions = data_roc_dtree$pred_prob,
                        labels = data_roc_dtree$actual)

plot(performance(bank_roc_dtree, "tpr", "fpr"))

AUC

bank_auc_dtree <- performance(bank_roc_dtree, measure = "auc")
bank_auc_dtree@y.values
## [[1]]
## [1] 0.900137

Random Forest Model

Removing Collumns That have Variance near to zero

n0var <- nearZeroVar(bank)
bank_n0 <- bank[,-n0var]
bank_n0 <- bank_n0 %>% mutate_if(is.character, as.factor)

Create Random Forest Model

#set.seed(4204)
 
#ctrl <- trainControl(method = "repeatedcv",
            #          number = 5,
             #         repeats = 4)
 
#bank_rforest <- train(y ~ .,
              #         data = bank_n0,
               #        method = "rf",
                #       trControl = ctrl,
                 #      tuneLength = 5) #membuat 5 kombinasi mtry
#bank_rforest
 
#saveRDS(bank_rforest, 'bank_rforest.RDS')

Import Model

why we import model rforest that we made before? because it take too long of computation process. So we can save the model into .RDS.

bank_rforest <- readRDS('bank_rforest.RDS')
bank_rforest
## Random Forest 
## 
## 45211 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 4 times) 
## Summary of sample sizes: 36168, 36169, 36170, 36169, 36168, 36168, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8927639  0.1675162
##   11    0.9060184  0.4758433
##   21    0.9046526  0.4795149
##   30    0.9040222  0.4792886
##   40    0.9035411  0.4796156
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

Prediction

bank_rforest_pred <- predict(bank_rforest, data.test)

Model Random Forest Evaluation

confusionMatrix(bank_rforest_pred, data.test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7984    0
##        yes    0 1057
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.8831     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8831     
##          Detection Rate : 0.8831     
##    Detection Prevalence : 0.8831     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : no         
## 

ROC & AUC Random Forest

y_pred_prob_rf <- predict(object = bank_rforest, 
                           newdata = data.train, 
                           type = "prob")
bank_roc_rf <- data.frame(pred_prob = y_pred_prob_rf[, 'yes'],
           actual = ifelse(data.train$y == 'yes', 1, 0))
head(bank_roc_rf)

Plotting ROC

bank_roc_rf <- prediction(predictions = bank_roc_rf$pred_prob,
                        labels = bank_roc_rf$actual)

# ROC curve
plot(performance(bank_roc_rf, "tpr", "fpr"))

AUC

bank_auc_rf <- performance(bank_roc_rf, measure = "auc")
bank_auc_rf@y.values
## [[1]]
## [1] 0.9891592

Random Forest Model Interpretation

varImp(bank_rforest)
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                    Overall
## duration           100.000
## balance             36.299
## age                 33.958
## day                 30.103
## poutcomesuccess     23.268
## campaign            12.462
## previous             8.611
## housingyes           6.957
## monthmar             4.520
## contactunknown       4.332
## educationsecondary   3.952
## monthjun             3.881
## maritalmarried       3.791
## educationtertiary    3.717
## monthoct             3.500
## jobtechnician        3.442
## poutcomeunknown      3.409
## jobmanagement        3.407
## monthmay             3.288
## monthaug             3.207