Marketing of bank products is the aggregate function absorbed at providing facility to satisfy customer’s monetary needs and wants, more than the rivalry keeping in view the organizational objectives. Banking is a personalized service oriented industry and hence should provide services which satisfy the customers’ needs. The marketing tactic includes forestalling, classifying, responding and satisfying the customers’ needs and wants effectually, professionally, and beneficially. [Mahtab N, Abdullah M (2016) Marketing of Financial and Banking Products: An Example from Bangladeshi Bank. J Account Mark 5:159. doi: 10.4172/2168-9601.1000159]
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
As already mention in the previous section, from the data we receive from
Source: Bank Marketing Data Set - UCI Machine Learning
we will try to classified whether the client is going to subscribed or not the banking product that delivered by marketing team using the variables/parameter mention in this data
library(dplyr)
library(caret)
library(tidyverse)
library(partykit)
library(rsample)
library("inspectdf")
library(GGally)bank <- read.csv("bank-additional-full.csv", sep = ";")
head(bank)Column Description :
age : Describing all client ages
job : type of job
marital : marital status of client such a single, married, divorce and ect.
education : Client level of education
default: default status
housing: Is the client has housing loan?
loan: Is the Client has personal loan ?
contact: contact communication type
month: last contact month of year
day_of_week: last contact day of the week
duration: last contact duration
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
emp.var.rate: Employment rates are defined as a measure of the extent to which available labour resources (people available to work) are being used. They are calculated as the ratio of the employed to the working age population. - quarterly indicator
cons.price.idx: the consumer price index measures the pure price change in a selected basket of goods and services (of constant quantity and quality) typically purchased - monthly indicator
cons.conf.idx: This consumer confidence indicator provides an indication of future developments of households’ consumption and saving, based upon answers regarding their expected financial situation, their sentiment about the general economic situation, unemployment and capability of savings. An indicator above 100 signals a boost in the consumers’ confidence towards the future economic situation, as a consequence of which they are less prone to save, and more inclined to spend money on major purchases in the next 12 months. - monthly indicator
euribor3m: Euribor is short for Euro Interbank Offered Rate. In this case euribor is for 3 month rate - daily indicator
nr.employed: number of employees - quarterly indicator
Class/y: The product is being subscribe or not by the Client
glimpse(bank)## Rows: 41,188
## Columns: 21
## $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, ...
## $ job <chr> "housemaid", "services", "services", "admin.", "serv...
## $ marital <chr> "married", "married", "married", "married", "married...
## $ education <chr> "basic.4y", "high.school", "high.school", "basic.6y"...
## $ default <chr> "no", "unknown", "no", "no", "no", "unknown", "no", ...
## $ housing <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "ye...
## $ loan <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no...
## $ contact <chr> "telephone", "telephone", "telephone", "telephone", ...
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "ma...
## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mo...
## $ duration <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55,...
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexi...
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.9...
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36...
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.8...
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191...
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no"...
Based on the summary above we found that the variable with character type of data is better to be transform into factor therefore we decided to transform it first before continue to the next step
bank <- bank %>%
mutate_if(is.character, as.factor)
glimpse(bank)## Rows: 41,188
## Columns: 21
## $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, ...
## $ job <fct> housemaid, services, services, admin., services, ser...
## $ marital <fct> married, married, married, married, married, married...
## $ education <fct> basic.4y, high.school, high.school, basic.6y, high.s...
## $ default <fct> no, unknown, no, no, no, unknown, no, unknown, no, n...
## $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, ...
## $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes...
## $ contact <fct> telephone, telephone, telephone, telephone, telephon...
## $ month <fct> may, may, may, may, may, may, may, may, may, may, ma...
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mo...
## $ duration <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55,...
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome <fct> nonexistent, nonexistent, nonexistent, nonexistent, ...
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.9...
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36...
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.8...
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, ...
We will see is there any NA value within this data
colSums(is.na(bank))## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
What a luck, we did not find any missing value within our data so we can continue to the next step
First we will try to find out the correlation between
GGally::ggcorr(bank %>% select_if(is.numeric), label = T)Based on the above graphic, we find out that there is one variable that having high correlation with more than one variable name emp.var.rate. therefore we try to eliminate this variable before we continue to modelling process.
bank <- bank %>%
select(-emp.var.rate)
head(bank)boxplot(scale(bank %>% select_if(is.numeric))) From the above graphic we see that there are several variable that showing outliers, nonetheless we decide to keep using all of data within this data by considering the importance of the outliers which might give signifcicant result of our model.
# check sample
prop.table(table(bank$y))##
## no yes
## 0.8873458 0.1126542
From the above result we find out that data proportion for column “y” might categorized as unbalance data proportion. therefore we decide to up sampling the data to create balance proportion.
# cross validation
set.seed(100)
spliter <- initial_split(bank, 0.8, strata = "y")
data_train <- training(spliter)
data_test <- testing (spliter)
data_test <- data_test %>%
rename(Class = y)data_train_up <- upSample(data_train %>% select(-y), data_train$y)levels(data_train_up$Class)## [1] "no" "yes"
levels(data_test$Class)## [1] "no" "yes"
prop.table(table(data_train_up$Class))##
## no yes
## 0.5 0.5
The Data now already balance and ready to be modeling
model_dt <- ctree(Class ~ . , data_train_up)We have to inspect the data levels in data train and data test to prevent unmatching data which will cause the model cannot run
inspect_cat(data_train_up)$levels## $Class
## # A tibble: 2 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.5 29239
## 2 yes 0.5 29239
##
## $contact
## # A tibble: 2 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 cellular 0.719 42059
## 2 telephone 0.281 16419
##
## $day_of_week
## # A tibble: 5 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 thu 0.218 12769
## 2 wed 0.201 11732
## 3 tue 0.199 11664
## 4 mon 0.195 11427
## 5 fri 0.186 10886
##
## $default
## # A tibble: 2 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.842 49228
## 2 unknown 0.158 9250
##
## $education
## # A tibble: 8 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 university.degree 0.322 18836
## 2 high.school 0.229 13384
## 3 professional.course 0.128 7498
## 4 basic.9y 0.128 7460
## 5 basic.4y 0.0973 5688
## 6 unknown 0.0488 2852
## 7 basic.6y 0.0468 2736
## 8 illiterate 0.000410 24
##
## $housing
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 yes 0.533 31187
## 2 no 0.443 25898
## 3 unknown 0.0238 1393
##
## $job
## # A tibble: 12 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 admin. 0.271 15819
## 2 blue-collar 0.186 10891
## 3 technician 0.164 9613
## 4 services 0.0854 4993
## 5 management 0.0694 4060
## 6 retired 0.0618 3615
## 7 student 0.0381 2228
## 8 self-employed 0.0325 1898
## 9 entrepreneur 0.0315 1841
## 10 unemployed 0.0268 1568
## 11 housemaid 0.0258 1509
## 12 unknown 0.00758 443
##
## $loan
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.828 48418
## 2 yes 0.148 8667
## 3 unknown 0.0238 1393
##
## $marital
## # A tibble: 4 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 married 0.577 33766
## 2 single 0.314 18357
## 3 divorced 0.106 6211
## 4 unknown 0.00246 144
##
## $month
## # A tibble: 10 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 may 0.272 15877
## 2 jul 0.159 9323
## 3 aug 0.143 8361
## 4 jun 0.126 7366
## 5 nov 0.0960 5614
## 6 apr 0.0870 5087
## 7 oct 0.0404 2361
## 8 mar 0.0337 1971
## 9 sep 0.0335 1961
## 10 dec 0.00952 557
##
## $poutcome
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 nonexistent 0.781 45653
## 2 failure 0.115 6740
## 3 success 0.104 6085
inspect_cat(data_test)$levels## $Class
## # A tibble: 2 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.887 7309
## 2 yes 0.113 927
##
## $contact
## # A tibble: 2 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 cellular 0.642 5285
## 2 telephone 0.358 2951
##
## $day_of_week
## # A tibble: 5 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 mon 0.214 1765
## 2 thu 0.211 1736
## 3 wed 0.198 1633
## 4 tue 0.190 1561
## 5 fri 0.187 1541
##
## $default
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.789 6500
## 2 unknown 0.210 1733
## 3 yes 0.000364 3
##
## $education
## # A tibble: 8 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 university.degree 0.295 2432
## 2 high.school 0.229 1884
## 3 basic.9y 0.145 1192
## 4 professional.course 0.129 1063
## 5 basic.4y 0.103 849
## 6 basic.6y 0.0583 480
## 7 unknown 0.0402 331
## 8 illiterate 0.000607 5
##
## $housing
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 yes 0.523 4305
## 2 no 0.453 3733
## 3 unknown 0.0240 198
##
## $job
## # A tibble: 12 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 admin. 0.250 2063
## 2 blue-collar 0.231 1901
## 3 technician 0.159 1311
## 4 services 0.0963 793
## 5 management 0.0714 588
## 6 retired 0.0429 353
## 7 entrepreneur 0.0355 292
## 8 self-employed 0.0355 292
## 9 unemployed 0.0253 208
## 10 housemaid 0.0239 197
## 11 student 0.0212 175
## 12 unknown 0.00765 63
##
## $loan
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 no 0.818 6741
## 2 yes 0.157 1297
## 3 unknown 0.0240 198
##
## $marital
## # A tibble: 4 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 married 0.602 4954
## 2 single 0.283 2332
## 3 divorced 0.114 935
## 4 unknown 0.00182 15
##
## $month
## # A tibble: 10 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 may 0.330 2714
## 2 jul 0.176 1449
## 3 aug 0.156 1281
## 4 jun 0.129 1059
## 5 nov 0.102 836
## 6 apr 0.0603 497
## 7 oct 0.0164 135
## 8 mar 0.0141 116
## 9 sep 0.0121 100
## 10 dec 0.00595 49
##
## $poutcome
## # A tibble: 3 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 nonexistent 0.865 7122
## 2 failure 0.103 848
## 3 success 0.0323 266
As we found in variable “default” there were different level between data_train and data_test then we will eliminate the level different as below
data_test <- data_test %>%
filter(!default == "yes") %>%
droplevels()pred_dt <- predict(model_dt, data_test)confusionMatrix(pred_dt, data_test$Class, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6374 158
## yes 932 769
##
## Accuracy : 0.8676
## 95% CI : (0.8601, 0.8749)
## No Information Rate : 0.8874
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5145
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8296
## Specificity : 0.8724
## Pos Pred Value : 0.4521
## Neg Pred Value : 0.9758
## Prevalence : 0.1126
## Detection Rate : 0.0934
## Detection Prevalence : 0.2066
## Balanced Accuracy : 0.8510
##
## 'Positive' Class : yes
##
In thiss opportunity we will choose to observe Accuracy metric as we want to see real comprassion using only true positive and true negative from data
From the confusion matrix above we see that the model is already showing good accuracy at about 86 %, eventhough the accuracy already showing good enough result we may make the model perform better by Pre Prunning tree method for adjusting this three parameter below.
mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.
minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.
minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3). maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.
In this case to compare the goodness of our model we just want to compare with another model which is Naive Bayes
There are certain characteristics of Naive Bayes that should be considered:
assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
more appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.
Based on the data Characteristic our data is actually not very appropriate for Naive Bayes. From the data description, some of our predictors have a high correlation with one another. Eventough, we are still going to try using Naive Bayes and the result will be compared with the other models. While building our Naive Bayes model, we should also apply Laplace estimator.
library(e1071)
model_naive <- naiveBayes(x = data_train_up %>% select(-Class),
y = data_train_up$Class,
laplace = 1)naive_class <- predict(model_naive, data_test, type = "class")
head(naive_class)## [1] no no no no no no
## Levels: no yes
library(caret)
confusionMatrix(naive_class, data_test$Class, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6247 264
## yes 1059 663
##
## Accuracy : 0.8393
## 95% CI : (0.8312, 0.8472)
## No Information Rate : 0.8874
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4149
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.71521
## Specificity : 0.85505
## Pos Pred Value : 0.38502
## Neg Pred Value : 0.95945
## Prevalence : 0.11260
## Detection Rate : 0.08053
## Detection Prevalence : 0.20916
## Balanced Accuracy : 0.78513
##
## 'Positive' Class : yes
##
Based on the metrics table above, the predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 86 % . Therefore the best model to predict subscription of client quality based on personal data of each client is decision tree.