Introduction

Marketing of bank products is the aggregate function absorbed at providing facility to satisfy customer’s monetary needs and wants, more than the rivalry keeping in view the organizational objectives. Banking is a personalized service oriented industry and hence should provide services which satisfy the customers’ needs. The marketing tactic includes forestalling, classifying, responding and satisfying the customers’ needs and wants effectually, professionally, and beneficially. [Mahtab N, Abdullah M (2016) Marketing of Financial and Banking Products: An Example from Bangladeshi Bank. J Account Mark 5:159. doi: 10.4172/2168-9601.1000159]

Business Question

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

What to do with this Data

As already mention in the previous section, from the data we receive from

Source: Bank Marketing Data Set - UCI Machine Learning

we will try to classified whether the client is going to subscribed or not the banking product that delivered by marketing team using the variables/parameter mention in this data

Calling Function We Might use

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(rsample)
library("inspectdf")
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Read Data

bank <- read.csv("bank-additional-full.csv", sep = ";")

head(bank)

Column Description :

age : Describing all client ages

job : type of job

marital : marital status of client such a single, married, divorce and ect.

education : Client level of education

default: default status

housing: Is the client has housing loan?

loan: Is the Client has personal loan ?

contact: contact communication type

month: last contact month of year

day_of_week: last contact day of the week

duration: last contact duration

campaign: number of contacts performed during this campaign and for this client

pdays: number of days that passed by after the client was last contacted from a previous campaign

previous: number of contacts performed before this campaign and for this client

poutcome: outcome of the previous marketing campaign

emp.var.rate: Employment rates are defined as a measure of the extent to which available labour resources (people available to work) are being used. They are calculated as the ratio of the employed to the working age population. - quarterly indicator

cons.price.idx: the consumer price index measures the pure price change in a selected basket of goods and services (of constant quantity and quality) typically purchased - monthly indicator

cons.conf.idx: This consumer confidence indicator provides an indication of future developments of households’ consumption and saving, based upon answers regarding their expected financial situation, their sentiment about the general economic situation, unemployment and capability of savings. An indicator above 100 signals a boost in the consumers’ confidence towards the future economic situation, as a consequence of which they are less prone to save, and more inclined to spend money on major purchases in the next 12 months. - monthly indicator

euribor3m: Euribor is short for Euro Interbank Offered Rate. In this case euribor is for 3 month rate - daily indicator

nr.employed: number of employees - quarterly indicator

Class/y: The product is being subscribe or not by the Client

Data Wrangling

Checking type of data on Each Variables

glimpse(bank)
## Rows: 41,188
## Columns: 21
## $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, ...
## $ job            <chr> "housemaid", "services", "services", "admin.", "serv...
## $ marital        <chr> "married", "married", "married", "married", "married...
## $ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y"...
## $ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", ...
## $ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "ye...
## $ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no...
## $ contact        <chr> "telephone", "telephone", "telephone", "telephone", ...
## $ month          <chr> "may", "may", "may", "may", "may", "may", "may", "ma...
## $ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mo...
## $ duration       <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55,...
## $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexi...
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.9...
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36...
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.8...
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191...
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no"...

Based on the summary above we found that the variable with character type of data is better to be transform into factor therefore we decided to transform it first before continue to the next step

bank <- bank %>% 
      mutate_if(is.character, as.factor)

glimpse(bank)
## Rows: 41,188
## Columns: 21
## $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, ...
## $ job            <fct> housemaid, services, services, admin., services, ser...
## $ marital        <fct> married, married, married, married, married, married...
## $ education      <fct> basic.4y, high.school, high.school, basic.6y, high.s...
## $ default        <fct> no, unknown, no, no, no, unknown, no, unknown, no, n...
## $ housing        <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, ...
## $ loan           <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes...
## $ contact        <fct> telephone, telephone, telephone, telephone, telephon...
## $ month          <fct> may, may, may, may, may, may, may, may, may, may, ma...
## $ day_of_week    <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mo...
## $ duration       <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55,...
## $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome       <fct> nonexistent, nonexistent, nonexistent, nonexistent, ...
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.9...
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36...
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.8...
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191...
## $ y              <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, ...

Checking NA

We will see is there any NA value within this data

colSums(is.na(bank))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

What a luck, we did not find any missing value within our data so we can continue to the next step

EDA

Corelation test

First we will try to find out the correlation between

GGally::ggcorr(bank %>% select_if(is.numeric), label = T)

Based on the above graphic, we find out that there is one variable that having high correlation with more than one variable name emp.var.rate. therefore we try to eliminate this variable before we continue to modelling process.

bank <- bank %>% 
        select(-emp.var.rate)

head(bank)

Check Outliers

boxplot(scale(bank %>% select_if(is.numeric)))

From the above graphic we see that there are several variable that showing outliers, nonetheless we decide to keep using all of data within this data by considering the importance of the outliers which might give signifcicant result of our model.

Check Data Proportion

# check sample

prop.table(table(bank$y))
## 
##        no       yes 
## 0.8873458 0.1126542

From the above result we find out that data proportion for column “y” might categorized as unbalance data proportion. therefore we decide to up sampling the data to create balance proportion.

Cross Validation

# cross validation
set.seed(100)
spliter <- initial_split(bank, 0.8, strata = "y")

data_train <- training(spliter)
data_test <- testing (spliter)

data_test <- data_test %>% 
             rename(Class = y)

Upsammpling data

data_train_up <- upSample(data_train %>% select(-y), data_train$y)

Check Data Level

levels(data_train_up$Class)
## [1] "no"  "yes"
levels(data_test$Class)
## [1] "no"  "yes"

Re-check Data Proportion

prop.table(table(data_train_up$Class))
## 
##  no yes 
## 0.5 0.5

The Data now already balance and ready to be modeling

Modelling : Dtree

model_dt <- ctree(Class ~ . , data_train_up)

Additional step

We have to inspect the data levels in data train and data test to prevent unmatching data which will cause the model cannot run

inspect_cat(data_train_up)$levels
## $Class
## # A tibble: 2 x 3
##   value  prop   cnt
##   <chr> <dbl> <int>
## 1 no      0.5 29239
## 2 yes     0.5 29239
## 
## $contact
## # A tibble: 2 x 3
##   value      prop   cnt
##   <chr>     <dbl> <int>
## 1 cellular  0.719 42059
## 2 telephone 0.281 16419
## 
## $day_of_week
## # A tibble: 5 x 3
##   value  prop   cnt
##   <chr> <dbl> <int>
## 1 thu   0.218 12769
## 2 wed   0.201 11732
## 3 tue   0.199 11664
## 4 mon   0.195 11427
## 5 fri   0.186 10886
## 
## $default
## # A tibble: 2 x 3
##   value    prop   cnt
##   <chr>   <dbl> <int>
## 1 no      0.842 49228
## 2 unknown 0.158  9250
## 
## $education
## # A tibble: 8 x 3
##   value                   prop   cnt
##   <chr>                  <dbl> <int>
## 1 university.degree   0.322    18836
## 2 high.school         0.229    13384
## 3 professional.course 0.128     7498
## 4 basic.9y            0.128     7460
## 5 basic.4y            0.0973    5688
## 6 unknown             0.0488    2852
## 7 basic.6y            0.0468    2736
## 8 illiterate          0.000410    24
## 
## $housing
## # A tibble: 3 x 3
##   value     prop   cnt
##   <chr>    <dbl> <int>
## 1 yes     0.533  31187
## 2 no      0.443  25898
## 3 unknown 0.0238  1393
## 
## $job
## # A tibble: 12 x 3
##    value            prop   cnt
##    <chr>           <dbl> <int>
##  1 admin.        0.271   15819
##  2 blue-collar   0.186   10891
##  3 technician    0.164    9613
##  4 services      0.0854   4993
##  5 management    0.0694   4060
##  6 retired       0.0618   3615
##  7 student       0.0381   2228
##  8 self-employed 0.0325   1898
##  9 entrepreneur  0.0315   1841
## 10 unemployed    0.0268   1568
## 11 housemaid     0.0258   1509
## 12 unknown       0.00758   443
## 
## $loan
## # A tibble: 3 x 3
##   value     prop   cnt
##   <chr>    <dbl> <int>
## 1 no      0.828  48418
## 2 yes     0.148   8667
## 3 unknown 0.0238  1393
## 
## $marital
## # A tibble: 4 x 3
##   value       prop   cnt
##   <chr>      <dbl> <int>
## 1 married  0.577   33766
## 2 single   0.314   18357
## 3 divorced 0.106    6211
## 4 unknown  0.00246   144
## 
## $month
## # A tibble: 10 x 3
##    value    prop   cnt
##    <chr>   <dbl> <int>
##  1 may   0.272   15877
##  2 jul   0.159    9323
##  3 aug   0.143    8361
##  4 jun   0.126    7366
##  5 nov   0.0960   5614
##  6 apr   0.0870   5087
##  7 oct   0.0404   2361
##  8 mar   0.0337   1971
##  9 sep   0.0335   1961
## 10 dec   0.00952   557
## 
## $poutcome
## # A tibble: 3 x 3
##   value        prop   cnt
##   <chr>       <dbl> <int>
## 1 nonexistent 0.781 45653
## 2 failure     0.115  6740
## 3 success     0.104  6085
inspect_cat(data_test)$levels
## $Class
## # A tibble: 2 x 3
##   value  prop   cnt
##   <chr> <dbl> <int>
## 1 no    0.887  7309
## 2 yes   0.113   927
## 
## $contact
## # A tibble: 2 x 3
##   value      prop   cnt
##   <chr>     <dbl> <int>
## 1 cellular  0.642  5285
## 2 telephone 0.358  2951
## 
## $day_of_week
## # A tibble: 5 x 3
##   value  prop   cnt
##   <chr> <dbl> <int>
## 1 mon   0.214  1765
## 2 thu   0.211  1736
## 3 wed   0.198  1633
## 4 tue   0.190  1561
## 5 fri   0.187  1541
## 
## $default
## # A tibble: 3 x 3
##   value       prop   cnt
##   <chr>      <dbl> <int>
## 1 no      0.789     6500
## 2 unknown 0.210     1733
## 3 yes     0.000364     3
## 
## $education
## # A tibble: 8 x 3
##   value                   prop   cnt
##   <chr>                  <dbl> <int>
## 1 university.degree   0.295     2432
## 2 high.school         0.229     1884
## 3 basic.9y            0.145     1192
## 4 professional.course 0.129     1063
## 5 basic.4y            0.103      849
## 6 basic.6y            0.0583     480
## 7 unknown             0.0402     331
## 8 illiterate          0.000607     5
## 
## $housing
## # A tibble: 3 x 3
##   value     prop   cnt
##   <chr>    <dbl> <int>
## 1 yes     0.523   4305
## 2 no      0.453   3733
## 3 unknown 0.0240   198
## 
## $job
## # A tibble: 12 x 3
##    value            prop   cnt
##    <chr>           <dbl> <int>
##  1 admin.        0.250    2063
##  2 blue-collar   0.231    1901
##  3 technician    0.159    1311
##  4 services      0.0963    793
##  5 management    0.0714    588
##  6 retired       0.0429    353
##  7 entrepreneur  0.0355    292
##  8 self-employed 0.0355    292
##  9 unemployed    0.0253    208
## 10 housemaid     0.0239    197
## 11 student       0.0212    175
## 12 unknown       0.00765    63
## 
## $loan
## # A tibble: 3 x 3
##   value     prop   cnt
##   <chr>    <dbl> <int>
## 1 no      0.818   6741
## 2 yes     0.157   1297
## 3 unknown 0.0240   198
## 
## $marital
## # A tibble: 4 x 3
##   value       prop   cnt
##   <chr>      <dbl> <int>
## 1 married  0.602    4954
## 2 single   0.283    2332
## 3 divorced 0.114     935
## 4 unknown  0.00182    15
## 
## $month
## # A tibble: 10 x 3
##    value    prop   cnt
##    <chr>   <dbl> <int>
##  1 may   0.330    2714
##  2 jul   0.176    1449
##  3 aug   0.156    1281
##  4 jun   0.129    1059
##  5 nov   0.102     836
##  6 apr   0.0603    497
##  7 oct   0.0164    135
##  8 mar   0.0141    116
##  9 sep   0.0121    100
## 10 dec   0.00595    49
## 
## $poutcome
## # A tibble: 3 x 3
##   value         prop   cnt
##   <chr>        <dbl> <int>
## 1 nonexistent 0.865   7122
## 2 failure     0.103    848
## 3 success     0.0323   266

As we found in variable “default” there were different level between data_train and data_test then we will eliminate the level different as below

data_test <- data_test %>% 
             filter(!default == "yes") %>% 
             droplevels()

Evaluasi model

pred_dt <- predict(model_dt, data_test)

Confussion matrix

confusionMatrix(pred_dt, data_test$Class, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6374  158
##        yes  932  769
##                                           
##                Accuracy : 0.8676          
##                  95% CI : (0.8601, 0.8749)
##     No Information Rate : 0.8874          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5145          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8296          
##             Specificity : 0.8724          
##          Pos Pred Value : 0.4521          
##          Neg Pred Value : 0.9758          
##              Prevalence : 0.1126          
##          Detection Rate : 0.0934          
##    Detection Prevalence : 0.2066          
##       Balanced Accuracy : 0.8510          
##                                           
##        'Positive' Class : yes             
## 

From the confusion matrix above we see that the model is already showing good accuracy at about 86 %, eventhough the accuracy already showing good enough result we may make the model perform better by Pre Prunning tree method for adjusting this three parameter below.

mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.

minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.

minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3). maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.

In this case to compare the goodness of our model we just want to compare with another model which is Naive Bias

Model Naive Bias

There are certain characteristics of Naive Bayes that should be considered:

assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).

prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.

more appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.

apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

Based on the data Characteristic our data is actually not very appropriate for Naive Bayes. From the data description, some of our predictors have a high correlation with one another. Eventough, we are still going to try using Naive Bayes and the result will be compared with the other models. While building our Naive Bayes model, we should also apply Laplace estimator.

library(e1071)

model_naive <- naiveBayes(x = data_train_up %>% select(-Class),
                          y = data_train_up$Class,
                          laplace = 1)

Prediction

naive_class <- predict(model_naive, data_test, type = "class")

head(naive_class)
## [1] no no no no no no
## Levels: no yes

Model Evaluation

library(caret)

confusionMatrix(naive_class, data_test$Class, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6247  264
##        yes 1059  663
##                                           
##                Accuracy : 0.8393          
##                  95% CI : (0.8312, 0.8472)
##     No Information Rate : 0.8874          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4149          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.71521         
##             Specificity : 0.85505         
##          Pos Pred Value : 0.38502         
##          Neg Pred Value : 0.95945         
##              Prevalence : 0.11260         
##          Detection Rate : 0.08053         
##    Detection Prevalence : 0.20916         
##       Balanced Accuracy : 0.78513         
##                                           
##        'Positive' Class : yes             
## 

Conslusion

Based on the metrics table above, the predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 86 % . Therefore the best model to predict subscription of client quality based on personal data of each client is decision tree.