The bank marketing campaigns are dependent on customers´ data. The size of these data is so huge that is impossible for a Data Analyst extract good information that could help in the decision-making process.
Machine Learning models are completely helping in the performance of these campaings. So, this text shows a brief test of Decision Tree model to analyse a marketing campaigns.
The results show that the model are fitted to evaluate train data considering that errors is so low (6.4%) and the accuracy in the test set is 90.8%.
This dataset has been dowloaded from UCI Machine Learning Repository. Is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
All the features are described here: http://goo.gl/i4YyPs
The goal is to build models which predict if the client will subscribe a term deposit(y)
Check the atributes
'data.frame': 41188 obs. of 21 variables:
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
$ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
$ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
$ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
$ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
$ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
$ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
$ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
$ duration : int 261 149 226 151 307 198 139 217 380 50 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
$ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
$ cons.price.idx: num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Dimensions of Data
dim(dfcat)
## [1] 41188 21
Summary Data
Social analysis
Credit profile
Duration of Call Feature
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 102.0 180.0 258.3 319.0 4918.0
We will use the C5.0 algorithm in the C50 package to train our decision tree model.Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem.
Create a training and testing set, which can be used to train the dataset and test those values with the test set.
set.seed(123)
train_sample <- sample(41188, 37069)
df_train <- dfcat[train_sample,]
df_test <- dfcat[-train_sample,]
Look the proportion of outcome categories.
prop.table(table(df_train$y))
##
## no yes
## 0.8868596 0.1131404
Now we create a model using C5.0
cmodel <- C5.0(df_train[-21], df_train$y)
Results of train model
Evaluation on training data (37069 cases):
Decision Tree
----------------
Size Errors
315 2371( 6.4%) <<
(a) (b) <-classified as
---- ----
31988 887 (a): class no
1484 2710 (b): class yes
The Errors output notes that the model correctly classified all but 2371 of the 37069 training instances for an error rate of 6.4 percent. A total of 887 actual no values were incorrectly classified as no (false positives), while 1484 yes values were misclassified as no (false negatives).
Top 5 attribute usage
100.00% duration
100.00% poutcome
92.69% nr.employed
76.71% age
75.88% month
To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:
### Evaluate model performance
cmodel_pred <- predict(cmodel, df_test)
### Cross table validation
CrossTable(df_test$y, cmodel_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4119
##
##
## | predicted default
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 3527 | 146 | 3673 |
## | 0.856 | 0.035 | |
## ---------------|-----------|-----------|-----------|
## yes | 232 | 214 | 446 |
## | 0.056 | 0.052 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 3759 | 360 | 4119 |
## ---------------|-----------|-----------|-----------|
##
##
Out of the 4119 test y application records, our model correctly predicted 3527 and 214 did correctly, resulting in an accuracy of .908 percent and an error rate of 9.2 percent. This is a good performance for this kind of model although it´s necessary do more analysis and comparing with other classification models.
The results show that the model are fitted to evaluate train data considering that errors is so low (6.4%) and the accuracy in the test set is 90.8%.
[1] [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 [2] Max Kuhn, Steve Weston, Nathan Coulter and Mark Culp. C code for C5.0 by R. Quinlan (2015). C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.0-24. https://CRAN.R-project.org/package=C50