The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The dataset consists of 45211 observations and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.
A classification approach to predict which clients are more likely to subscribe for term deposits.
library(ggplot2)
library(caret)
library(caretEnsemble)
library(ROSE)
library(mlbench)
library(DMwR)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(tidyverse)
library(stringr)
#Load the dataset
#url<-"https://raw.githubusercontent.com/nitishghosal/Measuring-Campaign-Effectiveness/master/bank-full.csv"
#mydata <- as_data_frame(read.csv(url,stringsAsFactors = FALSE,na="NA"))
mydata <- read.csv("bank-full.csv")
#Summary on dataset
summary(mydata)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
str(mydata)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Next, we will visualize the relationship between each variable versus y (whether or not a client has subscribed to term deposit) through box chart. We found out that only duration has significance among all other variables.
## Understanding relation among important variables through visualizations
p_age <- ggplot(mydata, aes(factor(y), age)) + geom_boxplot(aes(fill = factor(y)))
p_age
p_balance <- ggplot(mydata, aes(factor(y), balance)) + geom_boxplot(aes(fill = factor(y)))
p_balance
p_day <- ggplot(mydata, aes(factor(y), day)) + geom_boxplot(aes(fill = factor(y)))
p_day
p_duration <- ggplot(mydata, aes(factor(y), duration)) + geom_boxplot(aes(fill = factor(y)))
p_duration
p_campaign <- ggplot(mydata, aes(factor(y), campaign)) + geom_boxplot(aes(fill = factor(y)))
p_campaign
p_pdays <- ggplot(mydata, aes(factor(y), pdays)) + geom_boxplot(aes(fill = factor(y)))
p_pdays
p_previous <- ggplot(mydata, aes(factor(y), previous)) + geom_boxplot(aes(fill = factor(y)))
p_previous
We will create dummy variables for our categorical variables.
#Generate dummy variables
for(level in unique(mydata$job)){
mydata[paste("job", level, sep = "_")] <- ifelse(mydata$job == level, 1, 0)
}
for(level in unique(mydata$marital)){
mydata[paste("marital", level, sep = "_")] <- ifelse(mydata$marital == level, 1, 0)
}
for(level in unique(mydata$education)){
mydata[paste("education", level, sep = "_")] <- ifelse(mydata$education == level, 1, 0)
}
mydata$default_yes <- ifelse(mydata$default == "yes", 1, 0)
mydata$housing_yes <- ifelse(mydata$housing == "yes", 1, 0)
mydata$loan_yes <- ifelse(mydata$loan == "yes", 1, 0)
for(level in unique(mydata$contact)){
mydata[paste("contact", level, sep = "_")] <- ifelse(mydata$contact == level, 1, 0)
}
for(level in unique(mydata$month)){
mydata[paste("month", level, sep = "_")] <- ifelse(mydata$month == level, 1, 0)
}
for(level in unique(mydata$poutcome)){
mydata[paste("poutcome", level, sep = "_")] <- ifelse(mydata$poutcome == level, 1, 0)
}
mydata$Class <- ifelse(mydata$y == "yes", "Yes", "No")
#Remove unwanted columns
mydata$X <- NULL
mydata$job <- NULL
mydata$marital <- NULL
mydata$education <- NULL
mydata$default <- NULL
mydata$housing <- NULL
mydata$loan <- NULL
mydata$contact <- NULL
mydata$month <- NULL
mydata$poutcome <- NULL
mydata$y <- NULL
mydata$Class <- as.factor((mydata$Class))
colnames(mydata)[11] <- "job_blue_collar"
colnames(mydata)[14] <- "job_admin"
colnames(mydata)[16] <- "job_self_employeed"
We split the data into training and testing samples and perform scaling.Since the variables are of different magnitude, scaling is recommended, though tree-based models do not usually require scaling to achieve good performance.
#Splitting
set.seed(1)
training_size <- floor(0.80 * nrow(mydata))
train_ind <- sample(seq_len(nrow(mydata)), size = training_size)
training <- mydata[train_ind, ]
testing <- mydata[-train_ind, ]
#Normalizing
preProcValues <- preProcess(training, method = c("center", "scale"))
scaled.training <- predict(preProcValues, training)
scaled.testing <- predict(preProcValues, testing)
We can observe that y is highly scewed. That is, we have a lot of non-subscribers; subscribers only account for about 11.7% of all clients. As a result, sampling is required to handle the imbalanced classifier. Here we will demonstrate 4 popular resampling techniques, including under-sampling, over-sampling, smote sampling, and rose sampling on the training data set.
#Sampling
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(2)
down_training <- downSample(x = scaled.training[, -ncol(scaled.training)],
y = scaled.training$Class)
up_training <- upSample(x = scaled.training[, -ncol(scaled.training)],
y = scaled.training$Class)
smote_training <- SMOTE(Class~., data = scaled.training)
rose_training <- ROSE(Class~., data = scaled.training, seed=2)$data
We will train the model with CART (classification and regression tree) as we would like more interpretation on the model instead of predictive power. Here we will use cross validation to train the model, and instead of accuracy, we will also use ROC as the evaluation metric (the closer to 1 the better).
#Model training - CART
set.seed(3)
orig_fit <- train(Class~., data = training,
method = "rpart",
metric = "ROC",
trControl = ctrl)
set.seed(4)
down_outside <- train(Class~., data = down_training,
method = "rpart",
metric = "ROC",
trControl = ctrl)
set.seed(5)
up_outside <- train(Class~., data = up_training,
method = "rpart",
metric = "ROC",
trControl = ctrl)
set.seed(6)
smote_outside <- train(Class~., data = smote_training,
method = "rpart",
metric = "ROC",
trControl = ctrl)
set.seed(7)
rose_outside <- train(Class~., data = rose_training,
method = "rpart",
metric = "ROC",
trControl = ctrl)
#Model testing - Original
original_model <- list(original = orig_fit)
test_roc <- function(model, data) {
library(pROC)
roc_obj <- roc(data$Class,
predict(model, data, type = "prob")[, "Yes"],
levels = c("No", "Yes"))
ci(roc_obj)
}
original_test <- lapply(original_model, test_roc, data = testing)
original_test <- lapply(original_test, as.vector)
original_test <- do.call("rbind", original_test)
colnames(original_test) <- c("lower", "ROC", "upper")
original_test <- as.data.frame(original_test)
#Model testing - Resampled
scaled_models <- list(down = down_outside,
up = up_outside,
SMOTE = smote_outside,
ROSE = rose_outside)
scaled_test <- lapply(scaled_models, test_roc, data = scaled.testing)
scaled_test <- lapply(scaled_test, as.vector)
scaled_test <- do.call("rbind", scaled_test)
colnames(scaled_test) <- c("lower", "ROC", "upper")
scaled_test <- as.data.frame(scaled_test)
cart_test <- rbind(original_test,scaled_test)
After training the models on resampled data sets, we will examine the performance, in terms of ROC. It is observed that over-sampling (up) has the best performance, of mean ROC at 0.8078530.
cart_test
## lower ROC upper
## original 0.7452805 0.7608143 0.7763481
## down 0.7932395 0.8065504 0.8198612
## up 0.7932395 0.8065504 0.8198612
## SMOTE 0.7508949 0.7639485 0.7770021
## ROSE 0.5791397 0.5912398 0.6033399
We will visualize and interpret the CART model. It is observed that indeed duration has the most impact on subscriptions.
fancyRpartPlot(up_outside$finalModel)
After converting the scaled duration back to original, we know the threshold call time is 473 seconds, or approximately 8 minutes. In other words, if a call lasts longer than 8 minutes, the chance of subscribing is at 84%. In addition, if a call is between 205 seconds and 473 seconds, and the contact method is unknown, then the chance of subscribing is 64%.