1 Introduction

This report will analyze the Bank Marketing data from marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same customer was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

2 Data Input & Library

#Entering data input from CSV file with ";" (semicolon) as separator and directly convert strings as factor.
bank <- read.csv("bank.csv", sep=";",stringsAsFactors = T)
library(ggplot2)
library(ggthemes)
library(rsconnect)
library(inspectdf)
library(dplyr)
library(keras)
library(caret)
library(recipes)
library(rsample)
Sys.setenv(RETICULATE_PYTHON = "d:/ProgramData/Anaconda3/envs/tf_image/python.exe")
RETICULATE_PYTHON="d:/ProgramData/Anaconda3/envs/tf_image/python.exe"
library(reticulate)
use_python("d:/ProgramData/Anaconda3/envs/tf_image/python.exe")
use_condaenv("tf_image",  required = T)

2.1 Data Inspection

#Here's a quick look at the initial data set we're using.
head(bank)
##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
## 6  35  management  single  tertiary      no     747      no   no cellular  23
##   month duration campaign pdays previous poutcome  y
## 1   oct       79        1    -1        0  unknown no
## 2   may      220        1   339        4  failure no
## 3   apr      185        1   330        1  failure no
## 4   jun      199        4    -1        0  unknown no
## 5   may      226        1    -1        0  unknown no
## 6   feb      141        2   176        3  failure no
tail(bank)
##      age           job marital education default balance housing loan  contact
## 4516  32      services  single secondary      no     473     yes   no cellular
## 4517  33      services married secondary      no    -333     yes   no cellular
## 4518  57 self-employed married  tertiary     yes   -3313     yes  yes  unknown
## 4519  57    technician married secondary      no     295      no   no cellular
## 4520  28   blue-collar married secondary      no    1137      no   no cellular
## 4521  44  entrepreneur  single  tertiary      no    1136     yes  yes cellular
##      day month duration campaign pdays previous poutcome  y
## 4516   7   jul      624        5    -1        0  unknown no
## 4517  30   jul      329        5    -1        0  unknown no
## 4518   9   may      153        1    -1        0  unknown no
## 4519  19   aug      151       11    -1        0  unknown no
## 4520   6   feb      129        4   211        3    other no
## 4521   3   apr      345        2   249        7    other no
#Below are the dimension of the dataset.
dim(bank)
## [1] 4521   17

Bank Marketing data contains 4,521 rows and 17 columns.

Below are the columns name in this dataset.

#Below are the columns in the dataset.
names(bank)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

2.2 Data Cleansing & Coertions

First, we need to check any missing value.

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
anyNA(bank)
## [1] FALSE

Luckily, there is no missing value in the dataset.

After that we need to check the data type in each column.

#Inspecting data type for each column.
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

3. Data Analysis

For introduction, we will use inspectdf to see mapping category of bank data and numerik distribution.

bank %>% 
  inspect_cat() %>% 
  show_plot()

bank %>% 
  inspect_num() %>% 
  show_plot()

We want to see percentage of customers response towards this campaign.

succ_tab <- data.frame(prop.table((table(bank$y)))*100)
colnames(succ_tab) <- c("response","perc(%)")
succ_tab
##   response perc(%)
## 1       no  88.476
## 2      yes  11.524

From the table above, we can see that 11,5% responded positively towards this campaign. Due to imbalance class in marketing outcome, in order to be able optimal Neural Network, we need to make the class balance.

We need to drop following variables for Neural Network:

  • month
  • day
  • pdays
  • duration
  • contact
bank <- bank %>%
  select(-c(month,day,pdays,duration,contact))

3.1 Age

From the chart below, customer’s age are dominantly between. Below are the distribution of the customer age.

ggplot(bank, aes(x=age))+
  geom_bar()+
  labs(x="Age",y="Count of Customers", title="Age of Customers")+
  theme_classic()+
  scale_x_continuous(breaks = seq(0, 90, 5))

3.2 Marital

We want to see the proportion of marital status and response of the customer.

mar_tab = data.frame(table(bank$marital, bank$y))
colnames(mar_tab) <-
  c("marital_status","response","number_of_customers")

ggplot(data=mar_tab, aes(x=marital_status,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Marital Status Distribution")

From the chart above, the marketing target are mostly married customer. Also, we can see that married customers are likely to respond positively compared to divorced and single.

3.3 Education

We want to know education level of the customers and their response toward this campaign.

ed_tab <- data.frame(table(bank$education,bank$y))
colnames(ed_tab) <- c("education","response","number_of_customers")
ggplot(data=ed_tab, aes(x=education,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Result Education Distribution")

We can see that most customers are in the secondary level. Looking by the response from the chart above, the higher the education, more likely the customer to give positive response toward the campaign.

3.4 Default

def_tab <- data.frame(table(bank$default))
colnames(def_tab) <- c("default","number_of_customers")
ggplot(data=def_tab, aes(x=default,y=number_of_customers, fill=default))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Default Customer Distribution")

Looking at the chart above that most of the customers in the marketing campaign have no default histories, which is logical that customers who default are less likely have idle fund.

Now we want to see the response distribution among the default status.

def_res_tab <- data.frame(table(bank$default,bank$y))
colnames(def_res_tab) <- c("default","response","number_of_customers")
ggplot(data=def_res_tab, aes(x=default,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Default Status Distribution")

From the chart above we can see that the most customers who responded positively have no default in their credit history.

3.5 Job

We want to map the customers’ job in this campaign.

job_tab <- data.frame(table(bank$job))
colnames(job_tab) <- c("job", "count")
ggplot(data=job_tab, aes(x=count, y=reorder(job,count), fill=job))+
  geom_bar(stat = 'identity')+
  labs(X=NULL,
       y=NULL,
       title="Customers' Job")+
  theme_light()

We can see from the chart above, most of the customers’ occupation are management and blue collar. Now we wan to see the response of each occupation.

job_y_tab <- data.frame(table(bank$job, bank$y))
colnames(job_y_tab) <- c("job","response","count")
ggplot(data=job_y_tab, aes(x=count,y=reorder(job,count), fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Result Job Distribution")

From the chart above, we can see that customers in management job are more likely to responded positively with the campaign.

We want to dig deeper with this job distribution. Usually, occupation is closely related to financial condition or their saving account. We would like to know average balance of each job category.

job_avg_bal <- aggregate(formula=balance~job, data=bank, FUN=mean)
job_avg_bal <- job_avg_bal[order(job_avg_bal$balance, decreasing = T),]
job_avg_bal
##              job  balance
## 6        retired 2319.191
## 4      housemaid 2083.804
## 5     management 1766.929
## 3   entrepreneur 1645.125
## 9        student 1543.821
## 12       unknown 1501.711
## 7  self-employed 1392.410
## 10    technician 1330.996
## 1         admin. 1226.736
## 8       services 1103.957
## 11    unemployed 1089.422
## 2    blue-collar 1085.162

Surprisingly, the highest average balance is in retired and housemaid category. But since the average is affected by outlier, we want to see how dispersed the balance in each job category.

ggplot(bank, aes(x=balance,y=job))+
  geom_boxplot(fill= "violet", color="black")+
   labs(y=NULL,title="Customer Job & Balance Distribution")+
  theme_light()

As we can see from the chart above, there are a lot of outliers in the balance data grouped by job, especially in the management job. Since the there are a lot of outliers, we decided to see the median, as shown in below table.

job_med_bal <- aggregate(formula=balance~job, data=bank, FUN=median)
job_med_bal <- job_med_bal[order(job_med_bal$balance, decreasing = T),]
job_med_bal
##              job balance
## 6        retired   672.5
## 12       unknown   655.5
## 5     management   577.0
## 7  self-employed   483.0
## 11    unemployed   473.5
## 10    technician   434.5
## 1         admin.   430.0
## 9        student   422.5
## 2    blue-collar   408.5
## 3   entrepreneur   365.5
## 4      housemaid   296.5
## 8       services   288.0
job_med_y_bal <- aggregate(formula=balance~job+y, data=bank, FUN=median)
job_med_y_bal <- job_med_y_bal[order(job_med_y_bal$balance, decreasing = T),]
job_med_y_bal
##              job   y balance
## 16     housemaid yes  1107.0
## 17    management yes  1064.0
## 18       retired yes  1029.5
## 23    unemployed yes   978.0
## 19 self-employed yes   780.5
## 24       unknown yes   732.0
## 14   blue-collar yes   664.0
## 12       unknown  no   634.0
## 13        admin. yes   579.0
## 22    technician yes   568.0
## 5     management  no   541.5
## 6        retired  no   511.0
## 7  self-employed  no   479.0
## 20      services yes   460.5
## 9        student  no   459.0
## 11    unemployed  no   415.0
## 10    technician  no   411.0
## 1         admin.  no   403.0
## 2    blue-collar  no   395.0
## 3   entrepreneur  no   376.0
## 21       student yes   331.0
## 8       services  no   266.0
## 4      housemaid  no   263.5
## 15  entrepreneur yes   170.0
ggplot(data=job_med_y_bal, aes(x=balance,y=reorder(job,balance), fill=y))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(y=NULL,title="Campaign Job & Balances Distribution")

Based on the chart above, there is pattern that each customers’ job category who responded yes has higher median balance than the customer’s who responded no. Which is, make sense that customer with higher median balance tend to respond positively towards the campaign compared to other.

3.6 Loan & Housing

We want to analyze the customer who has loan and housing loan.

ggplot(data=bank, aes(x=loan, fill=loan))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Customer Loan")+theme_light()

From the customer loan chart, we can see that most of the customer has no loan.

ggplot(data=bank, aes(x=housing, fill=housing))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Customer Housing Loan")+theme_light()

More than half of the customer has housing loan.

Next we want to know their response towards this campaign by looking at their liability condition.

loan_tab <- data.frame(table(bank$loan,bank$y))
colnames(loan_tab) <- c("loan", "response","count")
ggplot(data=loan_tab, aes(x=loan,y=count,fill=response))+
  geom_bar(stat = 'identity',position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Loan Status & Customer Response")+theme_light()

Looking at the chart above, customer with no loan tend to give positive response towards this campaign compared to customer who has loan. Which plausible that customer with loan tend to use the money to repay their loan.

housing_tab <- data.frame(table(bank$housing,bank$y))
colnames(housing_tab) <- c("housing", "response","count")
ggplot(data=housing_tab, aes(x=housing,y=count, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(title="Housing Loan & Customer Response")

From the chart above, customer with no housing loan tend to give positive response towards this campaign compared to customer who has no housing loan.

From the conditions above, we can conclude that, customers with no liability, they have more positive response than the customers who are in debt.

3.7 Previous Campaign.

ggplot(data=bank, aes(x=poutcome, fill=poutcome))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Contact Methods")+theme_light()

From the chart above, that we can see that most of the customers haven’t been contacted prior this campaign.

We want to check the result based on this previous marketing campaign.

p_out_tab <- data.frame(table(bank$poutcome,bank$y))
colnames(p_out_tab) <- c("prev_out", "response","count")
ggplot(p_out_tab, aes(x=prev_out,y=count, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(title="Contact Response")

Based on the chart above, we can see that:

  1. For previous campaign failure, customers are less likely to give positive responses on the campaign
  2. For previous campaign success, customers are more likely to give positive responses on the campaign
  3. For no previous campaign, the result is not substantial but most of the customers who gave positive result were not contacted prior this campaign.

3.8 Balances.

We want to look balances of the customer. First we want to map how disperse the balance data.

ggplot(bank, aes(x=balance))+
  geom_boxplot(fill= "violet", color="black")+
   labs(y=NULL,title="Customer Balance Distribution")+
  theme_light()

As we can see, it is quite hard to see at which point the median, so we try to see by using summary of balance.

summary(bank$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3313      69     444    1423    1480   71188

From the table above, the median of the balance is 444 and also with a lot of outliers, either minus balances or more than 1,500.

Previously, we already compared the balance with customers’ job, we want to see whether balance is related to education.

We will make filtered tables by using each education category, by subsetting each category as shown below.

prim_ed <- bank[(bank$education=="primary"),]
sec_ed <- bank[(bank$education=="secondary"),]
tert_ed <- bank[(bank$education=="tertiary"),]
unkn_ed <- bank[(bank$education=="unknown"),]
ggplot(prim_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Primary Education")+
  theme_light()

summary(prim_ed$balance)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -966.00    65.25   404.50  1411.54  1394.75 71188.00
ggplot(sec_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Secondary Education")+
  theme_light()

summary(sec_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2082.0    57.0   385.5  1196.8  1269.8 26452.0
ggplot(tert_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Tertiary Education")+
  theme_light()

summary(tert_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3313.0   102.0   583.5  1775.4  1840.0 42045.0
ggplot(unkn_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Secondary Education")+
  theme_light()

summary(unkn_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -715.0    88.5   574.0  1701.2  2066.0 21244.0

From each category summary and chart above, the higher the education, the higher the average and max balance in their account hence it could be argued that these are more likely to buy term deposits.

4. Neural Network

4.1 Pre-Processing

After cleaning, below are data set what we are going to use for neural network, we can see some variables are factor and numeric. But for neural network, we need to convert all the categorical to numeric.

str(bank)
## 'data.frame':    4521 obs. of  12 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

We will split training and test data using 80% training and 20% test.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

intrain <- initial_split(data=bank, prop=0.8, strata="y")
bank_train <- training(intrain)
bank_test <- testing(intrain)

head(bank_train)
##   age           job marital education default balance housing loan campaign
## 2  33      services married secondary      no    4789     yes  yes        1
## 3  35    management  single  tertiary      no    1350     yes   no        1
## 4  30    management married  tertiary      no    1476     yes  yes        4
## 7  36 self-employed married  tertiary      no     307     yes   no        1
## 8  39    technician married secondary      no     147     yes   no        2
## 9  41  entrepreneur married  tertiary      no     221     yes   no        2
##   previous poutcome  y
## 2        4  failure no
## 3        1  failure no
## 4        0  unknown no
## 7        2    other no
## 8        0  unknown no
## 9        0  unknown no

For Keras machine learning, we need to convert raw data to machine learning dataset using recipes library.

Code below are the “recipes” for baking process to convert raw data to machine learning dataset.

rec_obj <- recipe(y~.,data=bank_train) %>% 
  step_discretize(age, options=list(cuts=6)) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_center(all_predictors(), -all_outcomes()) %>% 
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep(data=bank_train)
## Warning: Data not binned; too few unique values per bin. Adjust 'min_unique' as
## needed

We are baking the raw data using recipes (rec_obj) we made previously.

bank_tr_baked <- bake(rec_obj, new_data = bank_train) %>% select(-y)
bank_ts_baked <- bake(rec_obj, new_data = bank_test) %>% select(-y)

Glimpse of machine learning data.

head(bank_tr_baked)
## # A tibble: 6 x 26
##        age  balance campaign previous job_blue.collar job_entrepreneur
##      <dbl>    <dbl>    <dbl>    <dbl>           <dbl>            <dbl>
## 1 -0.770    1.16      -0.591    2.09           -0.518           -0.191
## 2 -0.579   -0.00960   -0.591    0.283          -0.518           -0.191
## 3 -1.06     0.0334     0.398   -0.320          -0.518           -0.191
## 4 -0.484   -0.365     -0.591    0.886          -0.518           -0.191
## 5 -0.198   -0.420     -0.261   -0.320          -0.518           -0.191
## 6 -0.00686 -0.395     -0.261   -0.320          -0.518            5.24 
## # ... with 20 more variables: job_housemaid <dbl>, job_management <dbl>,
## #   job_retired <dbl>, job_self.employed <dbl>, job_services <dbl>,
## #   job_student <dbl>, job_technician <dbl>, job_unemployed <dbl>,
## #   job_unknown <dbl>, marital_married <dbl>, marital_single <dbl>,
## #   education_secondary <dbl>, education_tertiary <dbl>,
## #   education_unknown <dbl>, default_yes <dbl>, housing_yes <dbl>,
## #   loan_yes <dbl>, poutcome_other <dbl>, poutcome_success <dbl>, ...

Convert y outcome column to 1 or 0.

y_bank_tr <- ifelse(pull(bank_train,y)=="yes", 1,0)
y_bank_ts <- ifelse(pull(bank_test,y)=="yes", 1,0)

Due to imbalance class in y and data count that is sufficient, we are going to balance the class using downSample.

bank_tr_y_baked <- bank_tr_baked %>% 
  mutate(y=as.factor(y_bank_tr))
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

bank_tr_y_bk_down <- downSample(x = bank_tr_y_baked %>% select(-y),
                         y = bank_tr_y_baked$y,
                         yname = "y")

We can see that the proportion is already balance.

prop.table(table(bank_tr_y_bk_down$y))
## 
##   0   1 
## 0.5 0.5

We will separate between predictors and outcome for ML training.

bank_tr_bk_down <- bank_tr_y_bk_down %>% select(-y)
bank_tr_y_down <- bank_tr_y_bk_down %>% select(y)

Convert predictors to matrix so python can read the data.

bank_tr_bk_down_mat <- as.matrix(bank_tr_bk_down)

dim(bank_tr_bk_down_mat)
## [1] 832  26

Convert outcome for ML training and change to matrix.

bank_tr_y_down_keras <- bank_tr_y_down %>% 
  as.matrix() %>% 
  to_categorical()
## Loaded Tensorflow version 2.3.0
dim(bank_tr_y_down_keras)
## [1] 832   2

4.2 Model build

In this part we are going to build two models, simple and complex model.

4.2.1 Simple Model

For simple model, we will only use two hidden layers with following configuration:

  • Units Input : 26
  • Hidden Layer 1 : 26
  • Hidden Layer 2 : 32
  • Units Output : 2
  • Activation : sigmoid
  • Loss Function : Binary crossentropy
  • Optimize : optimizer_adam
  • Learning Rate : 0.001
  • Epoch : 2000
  • Batch : 80
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

model_bank <- keras_model_sequential()
model_bank %>% 
  layer_dense(input_shape = ncol(bank_tr_bk_down_mat), 
              units = 26, 
              activation = "sigmoid",
              name = "hidden1",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
  layer_dense(units = 32, 
              activation = "sigmoid", 
              name = "hidden2",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
  layer_dense(units = 2, 
              activation = "sigmoid",
              name = "output",
              kernel_initializer = "uniform")

summary(model_bank)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## hidden1 (Dense)                     (None, 26)                      702         
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 26)                      0           
## ________________________________________________________________________________
## hidden2 (Dense)                     (None, 32)                      864         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 32)                      0           
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       66          
## ================================================================================
## Total params: 1,632
## Trainable params: 1,632
## Non-trainable params: 0
## ________________________________________________________________________________
model_bank %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_adam(learning_rate = 0.001), 
          metrics = "accuracy")
history <- model_bank %>% 
           fit(bank_tr_bk_down_mat,
               bank_tr_y_down_keras,
               epoch = 1750,
               batch= 80)
plot(history)
## `geom_smooth()` using formula 'y ~ x'

4.2.2 Complex Model

For complex model, we will use five hidden layers with following configuration:

  • Units Input : 26
  • Hidden Layer 1 : 26
  • Hidden Layer 2 : 32
  • Hidden Layer 2 : 128
  • Hidden Layer 2 : 64
  • Hidden Layer 2 : 16
  • Units Output : 2
  • Activation : sigmoid
  • Loss Function : Binary crossentropy
  • Optimize : optimizer_adam
  • Learning Rate : 0.001
  • Epoch : 250
  • Batch : 80
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

model_bank_comp <- keras_model_sequential()
model_bank_comp %>% 
  layer_dense(input_shape = ncol(bank_tr_bk_down_mat), 
              units = 26, 
              activation = "sigmoid",
              name = "hidden1",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
  layer_dense(units = 32, 
              activation = "sigmoid", 
              name = "hidden2",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
    layer_dense(units = 128, 
              activation = "sigmoid", 
              name = "hidden3",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
  layer_dense(units = 64, 
              activation = "sigmoid", 
              name = "hidden4",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
    layer_dense(units = 16, 
              activation = "sigmoid", 
              name = "hidden5",
              kernel_initializer = "uniform") %>%
  
  layer_dropout(rate = 0.001) %>%
  
  layer_dense(units = 2, 
              activation = "sigmoid",
              name = "output",
              kernel_initializer = "uniform")

summary(model_bank_comp)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## hidden1 (Dense)                     (None, 26)                      702         
## ________________________________________________________________________________
## dropout_6 (Dropout)                 (None, 26)                      0           
## ________________________________________________________________________________
## hidden2 (Dense)                     (None, 32)                      864         
## ________________________________________________________________________________
## dropout_5 (Dropout)                 (None, 32)                      0           
## ________________________________________________________________________________
## hidden3 (Dense)                     (None, 128)                     4224        
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## hidden4 (Dense)                     (None, 64)                      8256        
## ________________________________________________________________________________
## dropout_3 (Dropout)                 (None, 64)                      0           
## ________________________________________________________________________________
## hidden5 (Dense)                     (None, 16)                      1040        
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 16)                      0           
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       34          
## ================================================================================
## Total params: 15,120
## Trainable params: 15,120
## Non-trainable params: 0
## ________________________________________________________________________________
model_bank_comp %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_adam(learning_rate = 0.001), 
          metrics = "accuracy")
history_comp <- model_bank_comp %>% 
           fit(bank_tr_bk_down_mat,
               bank_tr_y_down_keras,
               epoch = 250,
               batch= 80)
plot(history_comp)
## `geom_smooth()` using formula 'y ~ x'

4.3 Predict

In this section we will predict using our test data using simple model and complex model.

4.3.1 Simple Model

bank_ts_bk_mat <- as.matrix(bank_ts_baked)
pred_bank_simp <- predict_classes(model_bank,bank_ts_bk_mat) 
head(pred_bank_simp)
## [1] 1 0 0 1 0 1

We evaluate simple model using confusionMatrix.

confusionMatrix(data=as.factor(pred_bank_simp),
                reference=as.factor(y_bank_ts),positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 458  39
##          1 342  66
##                                           
##                Accuracy : 0.579           
##                  95% CI : (0.5461, 0.6114)
##     No Information Rate : 0.884           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0892          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.62857         
##             Specificity : 0.57250         
##          Pos Pred Value : 0.16176         
##          Neg Pred Value : 0.92153         
##              Prevalence : 0.11602         
##          Detection Rate : 0.07293         
##    Detection Prevalence : 0.45083         
##       Balanced Accuracy : 0.60054         
##                                           
##        'Positive' Class : 1               
## 

As we can see that the simple model is overfit because for accuracy in test data is much lower than model accuracy. Also very low precision, that explains the model cannot predict positive class accurately, as we can see that the False Positive is quite high in this model. So this model need to be refined.

4.3.1 Complex Model

pred_bank_comp <- predict_classes(model_bank_comp,bank_ts_bk_mat) 
head(pred_bank_comp)
## [1] 0 0 1 0 1 1
confusionMatrix(data=as.factor(pred_bank_comp),
                reference=as.factor(y_bank_ts), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 515  43
##          1 285  62
##                                           
##                Accuracy : 0.6376          
##                  95% CI : (0.6053, 0.6689)
##     No Information Rate : 0.884           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.117           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59048         
##             Specificity : 0.64375         
##          Pos Pred Value : 0.17867         
##          Neg Pred Value : 0.92294         
##              Prevalence : 0.11602         
##          Detection Rate : 0.06851         
##    Detection Prevalence : 0.38343         
##       Balanced Accuracy : 0.61711         
##                                           
##        'Positive' Class : 1               
## 

As we can see that the complex model is also overfit, even though we added more layer. Also, for accuracy in test data is much lower than model accuracy. Also very low precision, that explains the model cannot predict positive class accurately, as we can see that the False Positive is quite high in this model. Even though we already added layer, it seems like the accuracy was not improving.

4.3.3 Predict using test data.

We will evaluate simple and complex model using test data to Kaggle submission.

test_uci <- read.csv("test.csv",stringsAsFactors = T)
test_uci_clean <- test_uci %>% select(-c(month,day,pdays,duration,contact))
test_uci_bake <- bake(rec_obj, new_data = test_uci_clean)
test_uci_bake_mat <- as.matrix(test_uci_bake)

Predict using simple model.

pred_uci_simp <- predict_classes(model_bank,test_uci_bake_mat)
head(pred_uci_simp)
## [1] 0 0 1 1 0 0
pred_uci_comp <- predict_classes(model_bank_comp,test_uci_bake_mat)
head(pred_uci_comp)
## [1] 0 0 1 1 0 0
submission <- read.csv("samplesubmission.csv")
test_simp <- submission %>% 
  mutate(y=pred_uci_simp)
test_comp <- submission %>% 
  mutate(y=pred_uci_comp)
write.csv(pred_uci_simp, "test_simp.csv")
write.csv(pred_uci_comp, "test_comp.csv")

Below are submission result to Kaggle for simple and complex model. Both models are still overfit with very low accuracy for both models.

5. Conclusion

This marketing campaign resulted 11.5% of success campaign. From the chart above we would like to point key finding that might affect the result the marketing campaign:

  1. Higher education would have affected the perception of the investment offered in the campaign.
  2. Higher median balance would affect the decision whether to allocate their balance for investment.
  3. Customer who has loans and housing loans are less likely to invest their money.
  4. The campaign should concentrate on new customer as seen on chart previous campaign chart, that most customer who responded positively are customer who never been called prior of the campaign.

For Machine learning section:

  1. Models need to be improved so they can predict positive class correctly (Precision) in test data also over all accuracy, like feature addition or using other ML method.
  2. Class imbalance added challenges in developing accurate model even though down sampling already done.

Reference

  • [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
  • “Keras Customer Churn”
  • “Bank Marketing UCI”