Classifying Whether or Not a Customer Opted for Term Deposit
Classifying Whether or Not a Customer Opted for Term Deposit
Foreground
In this classifying analysis, we would like to see whether a customer from a bank will opted for Term Deposit product or no, after a marketing campaign has been conducted. There are 45,211 customers observed during this analysis, they are coming from various background, demographic, and also profile. They were contacted by telemarketer agents who offering them Term Deposit product.
The goal of this analysis is to determine which customer’s profile that suitable for the Term Deposit product. Therefore, we can save the resources used for marketing campaign and telemarketing activities, by only focusing to the targeted customer.
Exploratory Data Analysis
Data Checking
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin","blue",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ deposit : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## age job marital education
## Min. :18.00 blue :9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management:9458 married :27214 secondary:23202
## Median :39.00 technician:7597 single :12790 tertiary :13301
## Mean :40.94 admin :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome deposit
## Min. : -1.0 Min. : 0.0000 failure: 4901 No :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 Yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
The explanation of every variables:
- age : Age of the customer (numeric)
- job : Type of job (categorical)
- marital : Marital status (categorical)
- education : Level of education of the customer (categorical)
- default : Has credit in default? “yes” if the customer has default
- balance : Average yearly balance, in euros (numeric)
- housing : Has housing loan? “yes” if the customer has housing loan
- loan : Has personal loan? “yes” if the customer has personal loan
- contact : Contact communication type (categorical)
- day : Last contact day of the month (numeric)
- month : Last contact month of year (categorical)
- duration : Last contact duration, in seconds (numeric) - important variable
- campaign : Number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays : Number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- previous : Number of contacts performed before this campaign and for this client (numeric)
- poutcome : Outcome of the previous marketing campaign (categorical)
- deposit: Has the client subscribed a term deposit? “yes” if the customer has personal loan - the target variable, with “yes” as “positive” variable
Feature Selection by Business Wise
After take a look at the data structure and summary, we will going to adjust and/or eliminate some of the data with following conditions:
- At the
jobvariable, the “entrepreneur” and “self” category will be merged into “entrepreneur” category. The “retired” and “unemployed” also going to merged into “unemployed”. And we will eliminate the “unknown” category since it will not giving us much clue and the numbers are insignificant.
- At the
educationvariable, we will eliminate the “unknown” category since it will not giving us much clue and the numbers are insignificant.
- The
contactvariable will be eliminated because there are to many “unknown” category, and we will assume all of customers contacted by telemarketers, regardless of the peripheral.
- We will eliminate
dayandmonthvariables, since there will be no different whether the telemarketer contact them on certain date or month.
- Remove the
poutcomevariable, because there are too many “unknown” category. Thus, we will assume that every customer exposed by marketing campaign, and the effect of the marketing campaign is not giving significant effect.
- We will decide to remove the
pdaysandprevioussince their distribution are highly right-skewed, and not going to give us much information.
The Final Data for Modelling:
bank_final <- bank %>%
mutate(job = replace(job, job == "self", "entrepreneur"),
job = replace(job, job == "retired", "unemployed")) %>%
filter(job != "unknown",
education != "unknown") %>%
select(-contact, -day, - month, -poutcome, -pdays, -previous) %>%
droplevels()
str(bank_final)## 'data.frame': 43193 obs. of 11 variables:
## $ age : int 58 44 33 35 28 42 58 43 41 29 ...
## $ job : Factor w/ 9 levels "admin","blue",..: 5 8 3 5 5 3 9 8 1 1 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
## $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 3 3 3 1 2 2 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ balance : int 2143 29 2 231 447 2 121 593 270 390 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ duration : int 261 151 76 139 217 380 50 55 222 137 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ deposit : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## age job marital education
## Min. :18.00 blue :9278 divorced: 5028 primary : 6800
## 1st Qu.:33.00 management:9216 married :25946 secondary:23131
## Median :39.00 technician:7355 single :12219 tertiary :13262
## Mean :40.76 admin :5000
## 3rd Qu.:48.00 services :4004
## Max. :95.00 unemployed:3419
## (Other) :4921
## default balance housing loan duration
## no :42411 Min. : -8019 no :18901 no :36086 Min. : 0.0
## yes: 782 1st Qu.: 71 yes:24292 yes: 7107 1st Qu.: 103.0
## Median : 442 Median : 180.0
## Mean : 1354 Mean : 258.3
## 3rd Qu.: 1412 3rd Qu.: 318.0
## Max. :102127 Max. :4918.0
##
## campaign deposit
## Min. : 1.000 No :38172
## 1st Qu.: 1.000 Yes: 5021
## Median : 2.000
## Mean : 2.758
## 3rd Qu.: 3.000
## Max. :58.000
##
Data Preprocessing
Splitting Data into Modelling and Validation Purpose
set.seed(289)
idx <- initial_split(data = bank_final, prop = 0.8, strata = "deposit")
bank_model <- training(idx) # 80% for modelling
bank_valid <- testing(idx) # 20% for validation
prop.table(table(bank_model$deposit))##
## No Yes
## 0.8855158 0.1144842
##
## No Yes
## 0.8767076 0.1232924
## [1] 34555 11
## [1] 8638 11
Splitting Modelling Data into Training and Testing Purpose
set.seed(289)
idx1 <- initial_split(data = bank_model, prop = 0.7, strata = "deposit")
bank_train <- training(idx1) # 70% for training
bank_test <- testing(idx1) # 30% for testing
prop.table(table(bank_train$deposit))##
## No Yes
## 0.8858572 0.1141428
##
## No Yes
## 0.8847193 0.1152807
## [1] 24189 11
## [1] 10366 11
Balancing Training Dataset
Since the proportion of customer who opted for Term Deposit (“Yes”) and the one not (“No”) is significant (11:88), we will going to balance the training dataset.
train_bank.new <- downSample(x = bank_train[ , -11],
y = bank_train$deposit,
yname = "deposit")
prop.table(table(train_bank.new$deposit))##
## No Yes
## 0.5 0.5
Modelling
There are three classification models that are going to used in this analysis: Naive Bayes, Decision Tree, and Random Forest. But before that, let’s set up the k-fold Cross validation.
k-Fold Cross Validation
ROC and AUC
Since we are not only depending on results given by Confusion Matrix, we going to make a function to plot ROC and calculate AUC for all of our models.
visTreeROC <- function(model, data, outcome){
# Core three steps from ROCR
temp_pred <- predict(model, newdata= data, type = "prob")
roc_pred <- prediction(temp_pred[,2], data[,outcome])
roc_perf <- performance(roc_pred, "tpr", "fpr")
# make the plot
plot(roc_perf, colorize = TRUE,
main = "ROC Curve",
lwd = 2)
abline(coef = c(0,1),
col = "black",
lwd = 2)
# Area Under the Curve
text(x = .6,
y = .3,
label = paste("Area Under the Curve:\n",
round(as.numeric(performance(roc_pred, "auc")@y.values), 5)))
}Naive Bayes Model
Without Tuning
bank_bayes <- train(deposit~., train_bank.new, method="naive_bayes", trControl = ctrl)
pred_bayes <- predict(bank_bayes, bank_test)
# Confusion Matrix
confusionMatrix(pred_bayes, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6951 262
## Yes 2220 933
##
## Accuracy : 0.7606
## 95% CI : (0.7522, 0.7688)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3146
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7579
## Specificity : 0.7808
## Pos Pred Value : 0.9637
## Neg Pred Value : 0.2959
## Prevalence : 0.8847
## Detection Rate : 0.6706
## Detection Prevalence : 0.6958
## Balanced Accuracy : 0.7693
##
## 'Positive' Class : No
##
As we can see in the Confusion Matrix, the Accuracy level was not so firm, even if it was not a bad number. The sensivity number also not so satisfying. Since we are going to targeting customer which predicted going to opted Term Deposit (Positive/Yes), we would like to have a higher numbers.
With Tuning
Let’s tune the previous Naive Bayes model, by setting laplace into 1, usekernel set as TRUE, and adjust the bandwidth into 1.
bayes_grid <- expand.grid(laplace=1, usekernel = TRUE, adjust=1)
bank_bayes_imp <- train(deposit~., train_bank.new, method="naive_bayes", trControl = ctrl, tuneGrid=bayes_grid)Plot Important Variables
The most important variable based on our improved model is
duration. Makes sense, since customer who interested and willing to take the Term Deposit will spend more duration on telemarketing session, because they want to know more about the details.
Confusion Matrix of Tuned Naive Bayes Model
pred_bayes_imp <- predict(bank_bayes_imp, bank_test)
confusionMatrix(pred_bayes_imp, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6951 262
## Yes 2220 933
##
## Accuracy : 0.7606
## 95% CI : (0.7522, 0.7688)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3146
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7579
## Specificity : 0.7808
## Pos Pred Value : 0.9637
## Neg Pred Value : 0.2959
## Prevalence : 0.8847
## Detection Rate : 0.6706
## Detection Prevalence : 0.6958
## Balanced Accuracy : 0.7693
##
## 'Positive' Class : No
##
Unfortunately, there is no difference between the tuned model and the previous one. The imblance proportion of deposit customer in Test Dataset may cause the model won’t reach expected result.
ROC and AUC of Tuned Naive Bayes Model
The good thing is that the ROC Curve is quite promising, and score a good AUC: 0.8409. We will keep the model and compare it to another.
Decision Tree Model
Without Tuning
bank_tree <- train(deposit~., train_bank.new, method="ctree", trControl = ctrl)
plot(bank_tree$finalModel, type = "simple")pred_tree <- predict(bank_tree, bank_test)
# Confusion Matrix
confusionMatrix(pred_tree, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6684 202
## Yes 2487 993
##
## Accuracy : 0.7406
## 95% CI : (0.732, 0.749)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3056
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7288
## Specificity : 0.8310
## Pos Pred Value : 0.9707
## Neg Pred Value : 0.2853
## Prevalence : 0.8847
## Detection Rate : 0.6448
## Detection Prevalence : 0.6643
## Balanced Accuracy : 0.7799
##
## 'Positive' Class : No
##
The plot model quite easy to interpret. The most significant variable is duration, same finding just like the Naive Bayes, and then followed by housing loan and personal loan. Quite makes sense, because customer who already take loan product (housing/personal) will be easier to understand and have better acceptance on another product.
Unfortunately, the Accuracy and Sensitivity of this model scores lower than the Naive Bayes. Let’s try to give it a tune.
With Tuning
Let us set the mincriterion into 0.95
tree_grid <- expand.grid(mincriterion = 0.95)
bank_tree_imp <- train(deposit~., train_bank.new, method="ctree", trControl = ctrl, tuneGrid = tree_grid)Plot Tuned Decision Tree
As expected, when the mincretrion is set into 0.95, the tree grew larger with more nodes. However, it doesn’t give us much different information than the previous models.
Confusion Matrix of Tuned Decision Tree Model
pred_tree_imp <- predict(bank_tree_imp, bank_test)
confusionMatrix(pred_tree_imp, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6760 209
## Yes 2411 986
##
## Accuracy : 0.7473
## 95% CI : (0.7388, 0.7556)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3121
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7371
## Specificity : 0.8251
## Pos Pred Value : 0.9700
## Neg Pred Value : 0.2903
## Prevalence : 0.8847
## Detection Rate : 0.6521
## Detection Prevalence : 0.6723
## Balanced Accuracy : 0.7811
##
## 'Positive' Class : No
##
The Accuracy and Sensitivity of the tuned model are improving, but still, not better than the Naive Bayes Model.
ROC and AUC of Tuned Decision Tree Model
The ROC Curve looks good, and the AUC scores slightly better than Naive Bayes. But since the Accuracy and Sensitivity are lower, we will no longer to use this model.
Random Forest Model
Without Tuning
bank_forest <- train(deposit~., data=train_bank.new, method="rf", trControl = ctrl)
plot(bank_forest$finalModel)
legend("topright", colnames(bank_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)# Confusion Matrix
pred_forest <- predict(bank_forest, bank_test)
confusionMatrix(pred_forest, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 7024 213
## Yes 2147 982
##
## Accuracy : 0.7723
## 95% CI : (0.7641, 0.7804)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3449
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7659
## Specificity : 0.8218
## Pos Pred Value : 0.9706
## Neg Pred Value : 0.3138
## Prevalence : 0.8847
## Detection Rate : 0.6776
## Detection Prevalence : 0.6981
## Balanced Accuracy : 0.7938
##
## 'Positive' Class : No
##
With Tuning
Let us try to minimize the ntree into 200, since based on previous plot we can see that after 200 trees there is no significance difference on the errors.
bank_forest_imp <- train(deposit~., data=train_bank.new, method="rf", trControl = ctrl, ntree = 200)Plotting The Model
plot(bank_forest_imp$finalModel)
legend("topright", colnames(bank_forest_imp$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6) As we can see, the final model plot still giving the same reading. And the varImp plot is giving
duration as the most important variable, similar like the Naive Bayes and Decision Tree.
Confusion Matrix of Tuned Random Forest Model
pred_forest_imp <- predict(bank_forest_imp, bank_test)
confusionMatrix(pred_forest_imp, bank_test$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 7019 218
## Yes 2152 977
##
## Accuracy : 0.7714
## 95% CI : (0.7632, 0.7794)
## No Information Rate : 0.8847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3421
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7653
## Specificity : 0.8176
## Pos Pred Value : 0.9699
## Neg Pred Value : 0.3122
## Prevalence : 0.8847
## Detection Rate : 0.6771
## Detection Prevalence : 0.6981
## Balanced Accuracy : 0.7915
##
## 'Positive' Class : No
##
The Accuracy reading is slighltly decreased than the previous Random Forest model, but on the other hand the Sensitivity is increasing. Compared to Naive Bayes, the tuned Random Forest model’s Accuracy is better, but the Accuracy is lower.
ROC and AUC of Tuned Random Forest Model
The ROC plot looks good, and the AUC score is the higher when compared to Naive Bayes (and even Decision Tree). We will use this model to predict the data validation.
Ultimate Test
And we came to the final test, we would like to see does the selected model -Random Forest- will give the consistent prediction when applied to validation dataset.
pred_forest_valid <- predict(bank_forest_imp, bank_valid)
# Confusion Matrix
confusionMatrix(pred_forest_valid, bank_valid$deposit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 5751 176
## Yes 1822 889
##
## Accuracy : 0.7687
## 95% CI : (0.7597, 0.7776)
## No Information Rate : 0.8767
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.357
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7594
## Specificity : 0.8347
## Pos Pred Value : 0.9703
## Neg Pred Value : 0.3279
## Prevalence : 0.8767
## Detection Rate : 0.6658
## Detection Prevalence : 0.6862
## Balanced Accuracy : 0.7971
##
## 'Positive' Class : No
##
The Accuracy and Sensitivity readings are lower if compared when the model applied to Test dataset. But still, the decrement is not significant, therefore we will use this model to predict, whether or not a customer opted for Term Deposit product.
Conclusion
- Some of the independent variables are not useful to be used in the prediction model. Some of them are irrelevant, but some of them are not completed (missing values/unknown). The result will pretty much different if we can have more completed values.
- The proportion of customers who opted the Term Deposit and not opted is very inbalanced (around 10:90), this highly inbalanced proportion may lead to low reading in Confusion Matrix analysis, because we only use the balanced proportion in train dataset, but not in testing or validation dataset.
- The most important variable is the last contact
durationof the telemarketer agent to the customer. As said before, this is valid, because customers who willing to take the Term Deposit product are going to have longer conversation.
- The Accuracy and Sensivity readings from all of the models are not significantly different, but the one who gives better Accuracy reading is Random Forest.
- The ROC plot between the 3 models, and also the AUC score also not significantly different. Highest AUC is given by Random Forest model, therefore we will use Random Forest as the model for this prediction analysis.