Introduction

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The dataset consists of 45211 observations and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]

Business Problem

There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.

Analytics Objective

A classification approach to predict which clients are more likely to subscribe for term deposits.

Data Import

library(ggplot2)
library(caret)
library(caretEnsemble)
library(ROSE)
library(mlbench)
library(DMwR)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(tidyverse)
library(stringr)

#Load the dataset
#url<-"https://raw.githubusercontent.com/nitishghosal/Measuring-Campaign-Effectiveness/master/bank-full.csv"

#mydata <- as_data_frame(read.csv(url,stringsAsFactors = FALSE,na="NA"))

mydata <- read.csv("bank-full.csv")

#Summary on dataset
summary(mydata)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 
str(mydata)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Data Visualization

Next, we will visualize the relationship between each variable versus y (whether or not a client has subscribed to term deposit) through box chart. We found out that only duration has significance among all other variables.

## Understanding relation among important variables through visualizations

p_age <- ggplot(mydata, aes(factor(y), age)) + geom_boxplot(aes(fill = factor(y)))
p_age

p_balance <- ggplot(mydata, aes(factor(y), balance)) + geom_boxplot(aes(fill = factor(y)))
p_balance

p_day <- ggplot(mydata, aes(factor(y), day)) + geom_boxplot(aes(fill = factor(y)))
p_day

p_duration <- ggplot(mydata, aes(factor(y), duration)) + geom_boxplot(aes(fill = factor(y)))
p_duration

p_campaign <- ggplot(mydata, aes(factor(y), campaign)) + geom_boxplot(aes(fill = factor(y)))
p_campaign

p_pdays <- ggplot(mydata, aes(factor(y), pdays)) + geom_boxplot(aes(fill = factor(y)))
p_pdays

p_previous <- ggplot(mydata, aes(factor(y), previous)) + geom_boxplot(aes(fill = factor(y)))
p_previous

Data Preparation

We will create dummy variables for our categorical variables.

#Generate dummy variables
for(level in unique(mydata$job)){
  mydata[paste("job", level, sep = "_")] <- ifelse(mydata$job == level, 1, 0)
}

for(level in unique(mydata$marital)){
  mydata[paste("marital", level, sep = "_")] <- ifelse(mydata$marital == level, 1, 0)
}

for(level in unique(mydata$education)){
  mydata[paste("education", level, sep = "_")] <- ifelse(mydata$education == level, 1, 0)
}

mydata$default_yes <- ifelse(mydata$default == "yes", 1, 0)

mydata$housing_yes <- ifelse(mydata$housing == "yes", 1, 0)

mydata$loan_yes <- ifelse(mydata$loan == "yes", 1, 0)

for(level in unique(mydata$contact)){
  mydata[paste("contact", level, sep = "_")] <- ifelse(mydata$contact == level, 1, 0)
}

for(level in unique(mydata$month)){
  mydata[paste("month", level, sep = "_")] <- ifelse(mydata$month == level, 1, 0)
}

for(level in unique(mydata$poutcome)){
  mydata[paste("poutcome", level, sep = "_")] <- ifelse(mydata$poutcome == level, 1, 0)
}

mydata$Class <- ifelse(mydata$y == "yes", "Yes", "No")

#Remove unwanted columns
mydata$X <- NULL
mydata$job <- NULL
mydata$marital <- NULL
mydata$education <- NULL
mydata$default <- NULL
mydata$housing <- NULL
mydata$loan <- NULL
mydata$contact <- NULL
mydata$month <- NULL
mydata$poutcome <- NULL
mydata$y <- NULL

mydata$Class <- as.factor((mydata$Class))
colnames(mydata)[11] <- "job_blue_collar"
colnames(mydata)[14] <- "job_admin"
colnames(mydata)[16] <- "job_self_employeed"

We split the data into training and testing samples and perform scaling.Since the variables are of different magnitude, scaling is recommended, though tree-based models do not usually require scaling to achieve good performance.

#Splitting
set.seed(1)
training_size <- floor(0.80 * nrow(mydata))
train_ind <- sample(seq_len(nrow(mydata)), size = training_size)
training <- mydata[train_ind, ]
testing <- mydata[-train_ind, ]

#Normalizing
preProcValues <- preProcess(training, method = c("center", "scale"))
scaled.training <- predict(preProcValues, training)
scaled.testing <- predict(preProcValues, testing)

Handling Imbalanced Data

We can observe that y is highly scewed. That is, we have a lot of non-subscribers; subscribers only account for about 11.7% of all clients. As a result, sampling is required to handle the imbalanced classifier. Here we will demonstrate 4 popular resampling techniques, including under-sampling, over-sampling, smote sampling, and rose sampling on the training data set.

#Sampling
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

set.seed(2)
down_training <- downSample(x = scaled.training[, -ncol(scaled.training)],
                            y = scaled.training$Class)

up_training <- upSample(x = scaled.training[, -ncol(scaled.training)],
                        y = scaled.training$Class)

smote_training <- SMOTE(Class~., data = scaled.training)

rose_training <- ROSE(Class~., data = scaled.training, seed=2)$data

Model Development

We will train the model with CART (classification and regression tree) as we would like more interpretation on the model instead of predictive power. Here we will use cross validation to train the model, and instead of accuracy, we will also use ROC as the evaluation metric (the closer to 1 the better).

#Model training - CART
set.seed(3)
orig_fit <- train(Class~., data = training, 
                  method = "rpart",
                  metric = "ROC",
                  trControl = ctrl)

set.seed(4)
down_outside <- train(Class~., data = down_training, 
                      method = "rpart",
                      metric = "ROC",
                      trControl = ctrl)

set.seed(5)
up_outside <- train(Class~., data = up_training, 
                    method = "rpart",
                    metric = "ROC",
                    trControl = ctrl)

set.seed(6)
smote_outside <- train(Class~., data = smote_training, 
                       method = "rpart",
                       metric = "ROC",
                       trControl = ctrl)

set.seed(7)
rose_outside <- train(Class~., data = rose_training, 
                      method = "rpart",
                      metric = "ROC",
                      trControl = ctrl)

Model Evaluation

#Model testing - Original
original_model <- list(original = orig_fit)

test_roc <- function(model, data) {
  library(pROC)
  roc_obj <- roc(data$Class, 
                 predict(model, data, type = "prob")[, "Yes"],
                 levels = c("No", "Yes"))
  ci(roc_obj)
}

original_test <- lapply(original_model, test_roc, data = testing)
original_test <- lapply(original_test, as.vector)
original_test <- do.call("rbind", original_test)
colnames(original_test) <- c("lower", "ROC", "upper")
original_test <- as.data.frame(original_test)

#Model testing - Resampled
scaled_models <- list(down = down_outside,
                      up = up_outside,
                      SMOTE = smote_outside,
                      ROSE = rose_outside)

scaled_test <- lapply(scaled_models, test_roc, data = scaled.testing)
scaled_test <- lapply(scaled_test, as.vector)
scaled_test <- do.call("rbind", scaled_test)
colnames(scaled_test) <- c("lower", "ROC", "upper")
scaled_test <- as.data.frame(scaled_test)

cart_test <- rbind(original_test,scaled_test)

After training the models on resampled data sets, we will examine the performance, in terms of ROC. It is observed that over-sampling (up) has the best performance, of mean ROC at 0.8078530.

cart_test
##              lower       ROC     upper
## original 0.7452805 0.7608143 0.7763481
## down     0.7932395 0.8065504 0.8198612
## up       0.7932395 0.8065504 0.8198612
## SMOTE    0.7508949 0.7639485 0.7770021
## ROSE     0.5791397 0.5912398 0.6033399

We will visualize and interpret the CART model. It is observed that indeed duration has the most impact on subscriptions.

fancyRpartPlot(up_outside$finalModel)

After converting the scaled duration back to original, we know the threshold call time is 473 seconds, or approximately 8 minutes. In other words, if a call lasts longer than 8 minutes, the chance of subscribing is at 84%. In addition, if a call is between 205 seconds and 473 seconds, and the contact method is unknown, then the chance of subscribing is 64%.