Introduction

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often more than one call to the same client was required in order to finalize whether or not they had subscribed to a term deposit. A term deposit is a cash investment to a financial institution for an agreed interest rate over a fixed time period. This project attempts to solve this problem using Bayesian Additive Regression Trees.

NOTE: This dataset is based on the “Bank Marketing” UCI dataset.

Variable Description

  • age: age of client
  • job : type of job
  • marital : marital status
  • education: highest educational achievement
  • default: has credit in default?
  • housing: has housing loan?
  • loan: has personal loan?
  • contact: contact communication type
  • month: last contact month of year
  • day_of_week: last contact day of the week
  • duration: last contact duration, in seconds
  • campaign: number of contacts performed during this campaign and for this client
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client
  • poutcome: outcome of the previous marketing campaign
  • emp.var.rate: employment variation rate - quarterly indicator
  • cons.price.idx: consumer price index - monthly indicator
  • cons.conf.idx: consumer confidence index - monthly indicator
  • euribor3m: euribor 3 month rate - daily indicator
  • nr.employed: number of employees - quarterly indicator
  • y - has the client subscribed a term deposit?

Data Cleaning

NOTE: Although this data is already well-cleaned, there are a few things that will be done and discussed.

Explore Raw Data

Data

datatable(bank, options = list(pageLength = 10, scrollX = TRUE))

Structure

glimpse(bank)
## Observations: 4,119
## Variables: 21
## $ age            <int> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36,...
## $ job            <fctr> blue-collar, services, services, services, adm...
## $ marital        <fctr> married, single, married, married, married, si...
## $ education      <fctr> basic.9y, high.school, high.school, basic.9y, ...
## $ default        <fctr> no, no, no, no, no, no, no, unknown, no, unkno...
## $ housing        <fctr> yes, no, yes, unknown, yes, no, yes, yes, no, ...
## $ loan           <fctr> no, no, no, unknown, no, no, no, no, no, no, n...
## $ contact        <fctr> cellular, telephone, telephone, telephone, cel...
## $ month          <fctr> may, may, jun, jun, nov, sep, sep, nov, nov, m...
## $ day_of_week    <fctr> fri, fri, wed, fri, mon, thu, mon, mon, tue, t...
## $ duration       <int> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 3...
## $ campaign       <int> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2,...
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous       <int> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ poutcome       <fctr> nonexistent, nonexistent, nonexistent, nonexis...
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0...
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199,...
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5...
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879...
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6,...
## $ y              <fctr> no, no, no, no, no, no, no, no, no, no, no, no...

Summary

summary(bank)
##       age                 job           marital    
##  Min.   :18.00   admin.     :1012   divorced: 446  
##  1st Qu.:32.00   blue-collar: 884   married :2509  
##  Median :38.00   technician : 691   single  :1153  
##  Mean   :40.11   services   : 393   unknown :  11  
##  3rd Qu.:47.00   management : 324                  
##  Max.   :88.00   retired    : 166                  
##                  (Other)    : 649                  
##                education       default        housing          loan     
##  university.degree  :1264   no     :3315   no     :1839   no     :3349  
##  high.school        : 921   unknown: 803   unknown: 105   unknown: 105  
##  basic.9y           : 574   yes    :   1   yes    :2175   yes    : 665  
##  professional.course: 535                                               
##  basic.4y           : 429                                               
##  basic.6y           : 228                                               
##  (Other)            : 168                                               
##       contact         month      day_of_week    duration     
##  cellular :2652   may    :1378   fri:768     Min.   :   0.0  
##  telephone:1467   jul    : 711   mon:855     1st Qu.: 103.0  
##                   aug    : 636   thu:860     Median : 181.0  
##                   jun    : 530   tue:841     Mean   : 256.8  
##                   nov    : 446   wed:795     3rd Qu.: 317.0  
##                   apr    : 215               Max.   :3643.0  
##                   (Other): 203                               
##     campaign          pdays          previous             poutcome   
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.0000   failure    : 454  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000   nonexistent:3523  
##  Median : 2.000   Median :999.0   Median :0.0000   success    : 142  
##  Mean   : 2.537   Mean   :960.4   Mean   :0.1903                     
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000                     
##  Max.   :35.000   Max.   :999.0   Max.   :6.0000                     
##                                                                      
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.635  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08497   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y       
##  Min.   :4964   no :3668  
##  1st Qu.:5099   yes: 451  
##  Median :5191             
##  Mean   :5166             
##  3rd Qu.:5228             
##  Max.   :5228             
## 

The campaign,duration, previous, and age variables clearly has outliers. This can be diagnosed further but it is out of the scope of this analysis.

Histogram

bank %>% 
  plot_ly(x = bank$age, 
          type = 'histogram', 
          color = bank$y) %>%
  layout(xaxis = list(title="Age"),
         yaxis = list(title="Count"))

Tidy Data

Principles

Tidy data ensures the dataset is a collection of values. The following are the Principles of Tidy Data:

  • Each observation forms a row
  • Each variable forms a column
  • Each type of observational unit forms a table
  • Each value belongs to a variable and an observation

Messy Data Indentifiers

Knowing this, these are symptons that identify messy data:

  • Column headers are values, not variable names
    • can be solved by gathering the columns headers as values into new columns
  • Multiple values are stored in one column
    • can be solved by separating the values into new columns
  • Variables are stored in both rows and columns
    • can be solved by spreading the variables into new column names
  • Multiple types of observational units are stored in the same table, or a single observational unit is stored in multiple tables
    • can be solved by using join or unjoin techniques

Data Preparation

There are two paths in preparing data:

  1. For Exploratory Data Analysis
  2. For Predictive Modeling

Data For Analysis

Preparing data for analysis considers the following:

  • Type conversion
  • String manipulation
  • Missing and special values
  • Feature engineering
  • Outliers and obvious errors

Missing Values

There are several missing values in some categorical attributes, all coded with the “unknown” label. These missing values can be treated as a possible class label or by using deletion/imputation techniques. For this analysis, I will use them as class labels to help draw more insight and predict the outcome.

Data For Modeling

Preparing data for modeling considers the following:

  • Splitting data into training_set and test_set (also a validation set if preferred)
  • Encode the categorical variables
  • Feature Scaling

Data Split

library(caTools)
set.seed(123)

bank$duration <- NULL

split <- sample.split(bank[,'y'], SplitRatio = 0.8)
training_set <- subset(bank, split == TRUE)
y_train <- training_set[,'y']
test_set <- subset(bank, split == FALSE)
y_test <- test_set[,'y']

dm <- rbind(Train=dim(training_set),Test=dim(test_set))
colnames(dm) <- c("Observations", "Features")
kable(dm)
Observations Features
Train 3295 20
Test 824 20

Encoding and Scaling

label_encoder <- function(df){
  for (i in 1:ncol(df)) {
    factr <- as.vector(sapply(df[i], is.factor))
    if (factr) {
      df[i] <- lapply(df[i], as.numeric)
    }
  }
  return(df)
}

#training_set[,-20] <- scale(label_encoder(training_set[,-20]))
training_set[,-20] <- label_encoder(training_set[,-20])
datatable(training_set, 
          options = list(pageLength = 5, 
                         scrollX = TRUE))
#test_set[,-20] <- scale(label_encoder(test_set[,-20]))
test_set[,-20] <- label_encoder(test_set[,-20])
datatable(test_set, 
          options = list(pageLength = 5, 
                         scrollX = TRUE))

Predictive Modeling

Bayesian Additive Regression Trees

“BART” is a Bayesian approach to nonparametric function estimation using regression trees which rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to approximate some unknown function. It can be considered a sum-of-trees ensemble, with an estimation approach relying on a fully Bayesian probability model.

Please refer here for a better overview of BART and the bartMachine package.

Training

library(bartMachine)
set_bart_machine_num_cores(2)
## bartMachine now using 2 cores.
bart <- bartMachine(as.data.frame(training_set),y_train)
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 22 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for classification ...Covariate importance prior ON. 
## evaluating in sample data...done
bart
## bartMachine v1.2.3 for classification
## 
## training data n = 3295 and p = 21 
## built in 8.1 secs on 2 cores, 50 trees, 250 burn-in and 1000 post. samples
## 
## confusion matrix:
## 
##            predicted no predicted yes model errors
## actual no          2934             0            0
## actual yes            0           361            0
## use errors            0             0            0

Testing

pred <- bart_predict_for_test_data(bart_machine = bart,
                                   Xtest = as.data.frame(test_set),
                                   ytest = y_test)
kable(pred$confusion_matrix)
predicted no predicted yes model errors
actual no 734 0 0
actual yes 0 90 0
use errors 0 0 0