Classifying Whether or Not a Customer Opted for Term Deposit

Foreground
Exploratory Data Analysis
- Data Checking
- Feature Selection by Business Wise
Data Preprocessing
Modelling
Ultimate Test
Conclusion

Foreground

In this classifying analysis, we would like to see whether a customer from a bank will opted for Term Deposit product or no, after a marketing campaign has been conducted. There are 45,211 customers observed during this analysis, they are coming from various background, demographic, and also profile. They were contacted by telemarketer agents who offering them Term Deposit product.

The goal of this analysis is to determine which customer’s profile that suitable for the Term Deposit product. Therefore, we can save the resources used for marketing campaign and telemarketing activities, by only focusing to the targeted customer.

Exploratory Data Analysis

Data Checking

bank <- read.csv("bank_customer_survey.csv")
str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin","blue",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ deposit  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(bank)

##       age                job           marital          education    
##  Min.   :18.00   blue      :9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management:9458   married :27214   secondary:23202  
##  Median :39.00   technician:7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services  :4154                                     
##  Max.   :95.00   retired   :2264                                     
##                  (Other)   :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome     deposit    
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   No :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   Yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
##

The explanation of every variables:
- age : Age of the customer (numeric)
- job : Type of job (categorical)
- marital : Marital status (categorical)
- education : Level of education of the customer (categorical)
- default : Has credit in default? “yes” if the customer has default
- balance : Average yearly balance, in euros (numeric)
- housing : Has housing loan? “yes” if the customer has housing loan
- loan : Has personal loan? “yes” if the customer has personal loan
- contact : Contact communication type (categorical)
- day : Last contact day of the month (numeric)
- month : Last contact month of year (categorical)
- duration : Last contact duration, in seconds (numeric) - important variable
- campaign : Number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays : Number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- previous : Number of contacts performed before this campaign and for this client (numeric)
- poutcome : Outcome of the previous marketing campaign (categorical)
- deposit: Has the client subscribed a term deposit? “yes” if the customer has personal loan - the target variable, with “yes” as “positive” variable

Feature Selection by Business Wise

After take a look at the data structure and summary, we will going to adjust and/or eliminate some of the data with following conditions:

At the job variable, the “entrepreneur” and “self” category will be merged into “entrepreneur” category. The “retired” and “unemployed” also going to merged into “unemployed”. And we will eliminate the “unknown” category since it will not giving us much clue and the numbers are insignificant.
At the education variable, we will eliminate the “unknown” category since it will not giving us much clue and the numbers are insignificant.
The contact variable will be eliminated because there are to many “unknown” category, and we will assume all of customers contacted by telemarketers, regardless of the peripheral.
We will eliminate day and month variables, since there will be no different whether the telemarketer contact them on certain date or month.
Remove the poutcome variable, because there are too many “unknown” category. Thus, we will assume that every customer exposed by marketing campaign, and the effect of the marketing campaign is not giving significant effect.
We will decide to remove the pdays and previous since their distribution are highly right-skewed, and not going to give us much information.

hist(bank$pdays, breaks = 40)

hist(bank$previous, breaks = 50)

The Final Data for Modelling:

bank_final <- bank %>% 
                      mutate(job = replace(job, job == "self", "entrepreneur"),
                             job = replace(job, job == "retired", "unemployed")) %>% 
                      filter(job != "unknown",
                             education != "unknown") %>% 
                      select(-contact, -day, - month, -poutcome, -pdays, -previous) %>% 
                      droplevels()


str(bank_final)

## 'data.frame':    43193 obs. of  11 variables:
##  $ age      : int  58 44 33 35 28 42 58 43 41 29 ...
##  $ job      : Factor w/ 9 levels "admin","blue",..: 5 8 3 5 5 3 9 8 1 1 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
##  $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 3 3 3 1 2 2 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ balance  : int  2143 29 2 231 447 2 121 593 270 390 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ duration : int  261 151 76 139 217 380 50 55 222 137 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ deposit  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(bank_final)

##       age                job           marital          education    
##  Min.   :18.00   blue      :9278   divorced: 5028   primary  : 6800  
##  1st Qu.:33.00   management:9216   married :25946   secondary:23131  
##  Median :39.00   technician:7355   single  :12219   tertiary :13262  
##  Mean   :40.76   admin     :5000                                     
##  3rd Qu.:48.00   services  :4004                                     
##  Max.   :95.00   unemployed:3419                                     
##                  (Other)   :4921                                     
##  default        balance       housing      loan          duration     
##  no :42411   Min.   : -8019   no :18901   no :36086   Min.   :   0.0  
##  yes:  782   1st Qu.:    71   yes:24292   yes: 7107   1st Qu.: 103.0  
##              Median :   442                           Median : 180.0  
##              Mean   :  1354                           Mean   : 258.3  
##              3rd Qu.:  1412                           3rd Qu.: 318.0  
##              Max.   :102127                           Max.   :4918.0  
##                                                                       
##     campaign      deposit    
##  Min.   : 1.000   No :38172  
##  1st Qu.: 1.000   Yes: 5021  
##  Median : 2.000              
##  Mean   : 2.758              
##  3rd Qu.: 3.000              
##  Max.   :58.000              
##

Data Preprocessing

Splitting Data into Modelling and Validation Purpose

set.seed(289)
idx <- initial_split(data = bank_final, prop = 0.8, strata = "deposit")
bank_model <- training(idx) # 80% for modelling
bank_valid <- testing(idx) # 20% for validation

prop.table(table(bank_model$deposit))

## 
##        No       Yes 
## 0.8855158 0.1144842

prop.table(table(bank_valid$deposit))

## 
##        No       Yes 
## 0.8767076 0.1232924

dim(bank_model)

## [1] 34555    11

dim(bank_valid)

## [1] 8638   11

Splitting Modelling Data into Training and Testing Purpose

set.seed(289)
idx1 <- initial_split(data = bank_model, prop = 0.7, strata = "deposit")
bank_train <- training(idx1) # 70% for training
bank_test <- testing(idx1) # 30% for testing

prop.table(table(bank_train$deposit))

## 
##        No       Yes 
## 0.8858572 0.1141428

prop.table(table(bank_test$deposit))

## 
##        No       Yes 
## 0.8847193 0.1152807

dim(bank_train)

## [1] 24189    11

dim(bank_test)

## [1] 10366    11

Balancing Training Dataset

Since the proportion of customer who opted for Term Deposit (“Yes”) and the one not (“No”) is significant (11:88), we will going to balance the training dataset.

train_bank.new <-  downSample(x = bank_train[ , -11], 
                              y = bank_train$deposit, 
                              yname = "deposit")

prop.table(table(train_bank.new$deposit))

## 
##  No Yes 
## 0.5 0.5

Modelling

There are three classification models that are going to used in this analysis: Naive Bayes, Decision Tree, and Random Forest. But before that, let’s set up the k-fold Cross validation.

k-Fold Cross Validation

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)

ROC and AUC

Since we are not only depending on results given by Confusion Matrix, we going to make a function to plot ROC and calculate AUC for all of our models.

visTreeROC <- function(model, data, outcome){

              # Core three steps from ROCR
              temp_pred <- predict(model, newdata= data, type = "prob")
              roc_pred <- prediction(temp_pred[,2], data[,outcome]) 
              roc_perf <- performance(roc_pred, "tpr", "fpr")
              
              # make the plot
              plot(roc_perf, colorize = TRUE, 
                   main = "ROC Curve", 
                   lwd = 2)
              abline(coef = c(0,1), 
                     col = "black", 
                     lwd = 2)
              
              # Area Under the Curve
              text(x = .6, 
                   y = .3, 
                   label = paste("Area Under the Curve:\n", 
                                                 round(as.numeric(performance(roc_pred, "auc")@y.values), 5)))
                                              }

Naive Bayes Model

Without Tuning

bank_bayes <- train(deposit~., train_bank.new, method="naive_bayes", trControl = ctrl)

pred_bayes <- predict(bank_bayes, bank_test)

# Confusion Matrix
confusionMatrix(pred_bayes, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6951  262
##        Yes 2220  933
##                                           
##                Accuracy : 0.7606          
##                  95% CI : (0.7522, 0.7688)
##     No Information Rate : 0.8847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3146          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7579          
##             Specificity : 0.7808          
##          Pos Pred Value : 0.9637          
##          Neg Pred Value : 0.2959          
##              Prevalence : 0.8847          
##          Detection Rate : 0.6706          
##    Detection Prevalence : 0.6958          
##       Balanced Accuracy : 0.7693          
##                                           
##        'Positive' Class : No              
##

As we can see in the Confusion Matrix, the Accuracy level was not so firm, even if it was not a bad number. The sensivity number also not so satisfying. Since we are going to targeting customer which predicted going to opted Term Deposit (Positive/Yes), we would like to have a higher numbers.

With Tuning

Let’s tune the previous Naive Bayes model, by setting laplace into 1, usekernel set as TRUE, and adjust the bandwidth into 1.

bayes_grid <- expand.grid(laplace=1, usekernel = TRUE, adjust=1)

bank_bayes_imp <- train(deposit~., train_bank.new, method="naive_bayes", trControl = ctrl, tuneGrid=bayes_grid)

Plot Important Variables

plot(varImp(bank_bayes_imp))

The most important variable based on our improved model is duration. Makes sense, since customer who interested and willing to take the Term Deposit will spend more duration on telemarketing session, because they want to know more about the details.

Confusion Matrix of Tuned Naive Bayes Model

pred_bayes_imp <- predict(bank_bayes_imp, bank_test)
confusionMatrix(pred_bayes_imp, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6951  262
##        Yes 2220  933
##                                           
##                Accuracy : 0.7606          
##                  95% CI : (0.7522, 0.7688)
##     No Information Rate : 0.8847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3146          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7579          
##             Specificity : 0.7808          
##          Pos Pred Value : 0.9637          
##          Neg Pred Value : 0.2959          
##              Prevalence : 0.8847          
##          Detection Rate : 0.6706          
##    Detection Prevalence : 0.6958          
##       Balanced Accuracy : 0.7693          
##                                           
##        'Positive' Class : No              
##

Unfortunately, there is no difference between the tuned model and the previous one. The imblance proportion of deposit customer in Test Dataset may cause the model won’t reach expected result.

ROC and AUC of Tuned Naive Bayes Model

visTreeROC(bank_bayes_imp, bank_test, "deposit")

The good thing is that the ROC Curve is quite promising, and score a good AUC: 0.8409. We will keep the model and compare it to another.

Decision Tree Model

Without Tuning

bank_tree <- train(deposit~., train_bank.new, method="ctree", trControl = ctrl)

plot(bank_tree$finalModel, type = "simple")

pred_tree <- predict(bank_tree, bank_test)

# Confusion Matrix
confusionMatrix(pred_tree, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6684  202
##        Yes 2487  993
##                                         
##                Accuracy : 0.7406        
##                  95% CI : (0.732, 0.749)
##     No Information Rate : 0.8847        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.3056        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.7288        
##             Specificity : 0.8310        
##          Pos Pred Value : 0.9707        
##          Neg Pred Value : 0.2853        
##              Prevalence : 0.8847        
##          Detection Rate : 0.6448        
##    Detection Prevalence : 0.6643        
##       Balanced Accuracy : 0.7799        
##                                         
##        'Positive' Class : No            
##

The plot model quite easy to interpret. The most significant variable is duration, same finding just like the Naive Bayes, and then followed by housing loan and personal loan. Quite makes sense, because customer who already take loan product (housing/personal) will be easier to understand and have better acceptance on another product.

Unfortunately, the Accuracy and Sensitivity of this model scores lower than the Naive Bayes. Let’s try to give it a tune.

With Tuning

Let us set the mincriterion into 0.95

tree_grid <- expand.grid(mincriterion = 0.95)

bank_tree_imp <- train(deposit~., train_bank.new, method="ctree", trControl = ctrl, tuneGrid = tree_grid)

Plot Tuned Decision Tree

plot(bank_tree_imp$finalModel, type = "simple")

As expected, when the mincretrion is set into 0.95, the tree grew larger with more nodes. However, it doesn’t give us much different information than the previous models.

Confusion Matrix of Tuned Decision Tree Model

pred_tree_imp <- predict(bank_tree_imp, bank_test)

confusionMatrix(pred_tree_imp, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6760  209
##        Yes 2411  986
##                                           
##                Accuracy : 0.7473          
##                  95% CI : (0.7388, 0.7556)
##     No Information Rate : 0.8847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3121          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7371          
##             Specificity : 0.8251          
##          Pos Pred Value : 0.9700          
##          Neg Pred Value : 0.2903          
##              Prevalence : 0.8847          
##          Detection Rate : 0.6521          
##    Detection Prevalence : 0.6723          
##       Balanced Accuracy : 0.7811          
##                                           
##        'Positive' Class : No              
##

The Accuracy and Sensitivity of the tuned model are improving, but still, not better than the Naive Bayes Model.

ROC and AUC of Tuned Decision Tree Model

visTreeROC(bank_tree_imp, bank_test, "deposit")

The ROC Curve looks good, and the AUC scores slightly better than Naive Bayes. But since the Accuracy and Sensitivity are lower, we will no longer to use this model.

Random Forest Model

Without Tuning

bank_forest <- train(deposit~., data=train_bank.new, method="rf", trControl = ctrl)

plot(bank_forest$finalModel)
legend("topright", colnames(bank_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

# Confusion Matrix
pred_forest <- predict(bank_forest, bank_test)

confusionMatrix(pred_forest, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  7024  213
##        Yes 2147  982
##                                           
##                Accuracy : 0.7723          
##                  95% CI : (0.7641, 0.7804)
##     No Information Rate : 0.8847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3449          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7659          
##             Specificity : 0.8218          
##          Pos Pred Value : 0.9706          
##          Neg Pred Value : 0.3138          
##              Prevalence : 0.8847          
##          Detection Rate : 0.6776          
##    Detection Prevalence : 0.6981          
##       Balanced Accuracy : 0.7938          
##                                           
##        'Positive' Class : No              
##

With Tuning

Let us try to minimize the ntree into 200, since based on previous plot we can see that after 200 trees there is no significance difference on the errors.

bank_forest_imp <- train(deposit~., data=train_bank.new, method="rf", trControl = ctrl, ntree = 200)

Plotting The Model

plot(bank_forest_imp$finalModel)
legend("topright", colnames(bank_forest_imp$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

# Plot Important Variables
plot(varImp(bank_forest_imp))

As we can see, the final model plot still giving the same reading. And the varImp plot is giving duration as the most important variable, similar like the Naive Bayes and Decision Tree.

Confusion Matrix of Tuned Random Forest Model

pred_forest_imp <- predict(bank_forest_imp, bank_test)

confusionMatrix(pred_forest_imp, bank_test$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  7019  218
##        Yes 2152  977
##                                           
##                Accuracy : 0.7714          
##                  95% CI : (0.7632, 0.7794)
##     No Information Rate : 0.8847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3421          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7653          
##             Specificity : 0.8176          
##          Pos Pred Value : 0.9699          
##          Neg Pred Value : 0.3122          
##              Prevalence : 0.8847          
##          Detection Rate : 0.6771          
##    Detection Prevalence : 0.6981          
##       Balanced Accuracy : 0.7915          
##                                           
##        'Positive' Class : No              
##

The Accuracy reading is slighltly decreased than the previous Random Forest model, but on the other hand the Sensitivity is increasing. Compared to Naive Bayes, the tuned Random Forest model’s Accuracy is better, but the Accuracy is lower.

ROC and AUC of Tuned Random Forest Model

visTreeROC(bank_forest_imp, bank_test, "deposit")

The ROC plot looks good, and the AUC score is the higher when compared to Naive Bayes (and even Decision Tree). We will use this model to predict the data validation.

Ultimate Test

And we came to the final test, we would like to see does the selected model -Random Forest- will give the consistent prediction when applied to validation dataset.

pred_forest_valid <- predict(bank_forest_imp, bank_valid)

# Confusion Matrix
confusionMatrix(pred_forest_valid, bank_valid$deposit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5751  176
##        Yes 1822  889
##                                           
##                Accuracy : 0.7687          
##                  95% CI : (0.7597, 0.7776)
##     No Information Rate : 0.8767          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.357           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7594          
##             Specificity : 0.8347          
##          Pos Pred Value : 0.9703          
##          Neg Pred Value : 0.3279          
##              Prevalence : 0.8767          
##          Detection Rate : 0.6658          
##    Detection Prevalence : 0.6862          
##       Balanced Accuracy : 0.7971          
##                                           
##        'Positive' Class : No              
##

The Accuracy and Sensitivity readings are lower if compared when the model applied to Test dataset. But still, the decrement is not significant, therefore we will use this model to predict, whether or not a customer opted for Term Deposit product.

Conclusion

Some of the independent variables are not useful to be used in the prediction model. Some of them are irrelevant, but some of them are not completed (missing values/unknown). The result will pretty much different if we can have more completed values.
The proportion of customers who opted the Term Deposit and not opted is very inbalanced (around 10:90), this highly inbalanced proportion may lead to low reading in Confusion Matrix analysis, because we only use the balanced proportion in train dataset, but not in testing or validation dataset.
The most important variable is the last contact duration of the telemarketer agent to the customer. As said before, this is valid, because customers who willing to take the Term Deposit product are going to have longer conversation.
The Accuracy and Sensivity readings from all of the models are not significantly different, but the one who gives better Accuracy reading is Random Forest.
The ROC plot between the 3 models, and also the AUC score also not significantly different. Highest AUC is given by Random Forest model, therefore we will use Random Forest as the model for this prediction analysis.