A Decision Tree model for Bank Marketing Analysis

Summary

The bank marketing campaigns are dependent on customers´ data. The size of these data is so huge that is impossible for a Data Analyst extract good information that could help in the decision-making process.

Machine Learning models are completely helping in the performance of these campaings. So, this text shows a brief test of Decision Tree model to analyse a marketing campaigns.

The results show that the model are fitted to evaluate train data considering that errors is so low (6.4%) and the accuracy in the test set is 90.8%.

Bank marketing Dataset

This dataset has been dowloaded from UCI Machine Learning Repository. Is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

All the features are described here: http://goo.gl/i4YyPs

The goal is to build models which predict if the client will subscribe a term deposit(y)

Exploratory analysis

Check the atributes

'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Dimensions of Data

dim(dfcat)

## [1] 41188    21

Summary Data

Social analysis

Credit profile

Duration of Call Feature

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   102.0   180.0   258.3   319.0  4918.0

Create Training and Test set

We will use the C5.0 algorithm in the C50 package to train our decision tree model.Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem.

Create a training and testing set, which can be used to train the dataset and test those values with the test set.

set.seed(123)
train_sample <- sample(41188, 37069)
df_train <- dfcat[train_sample,]
df_test <- dfcat[-train_sample,]

Look the proportion of outcome categories.

prop.table(table(df_train$y))

## 
##        no       yes 
## 0.8868596 0.1131404

Now we create a model using C5.0

cmodel <- C5.0(df_train[-21], df_train$y)

Results of train model

Evaluation on training data (37069 cases):

    Decision Tree   
  ----------------  
  Size      Errors  

   315 2371( 6.4%)   <<


   (a)   (b)    <-classified as
  ----  ----
 31988   887    (a): class no
  1484  2710    (b): class yes

The Errors output notes that the model correctly classified all but 2371 of the 37069 training instances for an error rate of 6.4 percent. A total of 887 actual no values were incorrectly classified as no (false positives), while 1484 yes values were misclassified as no (false negatives).

Top 5 attribute usage

100.00% duration
100.00% poutcome
 92.69% nr.employed
 76.71% age
 75.88% month

Evaluate model performance

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

### Evaluate model performance
cmodel_pred <- predict(cmodel, df_test)

### Cross table validation
CrossTable(df_test$y, cmodel_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4119 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |      3527 |       146 |      3673 | 
##                |     0.856 |     0.035 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |       232 |       214 |       446 | 
##                |     0.056 |     0.052 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      3759 |       360 |      4119 | 
## ---------------|-----------|-----------|-----------|
## 
##

Out of the 4119 test y application records, our model correctly predicted 3527 and 214 did correctly, resulting in an accuracy of .908 percent and an error rate of 9.2 percent. This is a good performance for this kind of model although it´s necessary do more analysis and comparing with other classification models.

Conclusions

The results show that the model are fitted to evaluate train data considering that errors is so low (6.4%) and the accuracy in the test set is 90.8%.

Reference

[1] [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 [2] Max Kuhn, Steve Weston, Nathan Coulter and Mark Culp. C code for C5.0 by R. Quinlan (2015). C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.0-24. https://CRAN.R-project.org/package=C50