DEFINE PROBLEM STATEMENT

Will the client subscribe the bank term deposit sponsored by Portuguese banking institution?

DATASET INFORMATION

The data is related with direct marketing campaigns on bank term deposit of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,in order to access if the product (i.e, bank term deposit) would be subscribed/not.

BANK MARKETING(DISPLAYING ONLY 12 OBSERVATIONS
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no
29 admin. single secondary no 390 yes no unknown 5 may 137 1 -1 0 unknown no

ATTRIBUTES INFORMATION

# bank client data:

age : age of the client (numeric)

job : type of a job (categorical)(“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)

marital : marital status (categorical)(“married”,“divorced”,“single”)

education : (categorical)(“unknown”,“secondary”,“primary”,“tertiary”)

default: has credit in default? (binary: “yes”,“no”)

balance: average yearly balance in euros (numeric)

housing: has housing loan? (binary: “yes”,“no”)

loan: has personal loan? (binary: “yes”,“no”)

# related with the last contact of the current campaign:

contact: contact communication type (categorical)(“unknown”,“telephone”,“cellular”)

day: last contact day of the month (numeric)

month: last contact month of year (categorical)(“jan”, “feb”, “mar”, …, “nov”, “dec”)

duration: last contact duration, in seconds (numeric)

# other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric),it includes last contact

pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric)(-1 means client was not previously contacted)

previous: number of contacts performed before this campaign and for this client (numeric)

poutcome: outcome of the previous marketing campaign (categorical)(“unknown”,“other”,“failure”,“success”)

y : has the client subscribed a term deposit? (binary: “yes”,“no”)

Here we have 45211 observations with 17 variables out of which the y variable is an outcome variable with binary classes and remaining 16 variables are the predictors.

summary(bmarketing)
      age                 job           marital          education    
 Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
 1st Qu.:33.00   management :9458   married :27214   secondary:23202  
 Median :39.00   technician :7597   single  :12790   tertiary :13301  
 Mean   :40.94   admin.     :5171                    unknown  : 1857  
 3rd Qu.:48.00   services   :4154                                     
 Max.   :95.00   retired    :2264                                     
                 (Other)    :6835                                     
 default        balance       housing      loan            contact     
 no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
 yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
             Median :   448                           unknown  :13020  
             Mean   :  1362                                            
             3rd Qu.:  1428                                            
             Max.   :102127                                            
                                                                       
      day            month          duration         campaign     
 Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
 1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
 Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
 Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
 3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
 Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
                 (Other): 6060                                    
     pdays          previous           poutcome       y        
 Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
 1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
 Median : -1.0   Median :  0.0000   success: 1511              
 Mean   : 40.2   Mean   :  0.5803   unknown:36959              
 3rd Qu.: -1.0   3rd Qu.:  0.0000                              
 Max.   :871.0   Max.   :275.0000                              
                                                               
bmarketing$age<-as.numeric(bmarketing$age)
bmarketing$balance<-as.numeric(bmarketing$balance)
bmarketing$day<-as.numeric(bmarketing$day)
bmarketing$duration<-as.numeric(bmarketing$duration)
bmarketing$campaign<-as.numeric(bmarketing$campaign)
bmarketing$pdays<-as.numeric(bmarketing$pdays)
bmarketing$previous<-as.numeric(bmarketing$previous)
str(bmarketing)
'data.frame':   45211 obs. of  17 variables:
 $ age      : num  58 44 33 47 33 35 28 42 58 43 ...
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ balance  : num  2143 29 2 1506 1 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ day      : num  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ duration : num  261 151 76 92 198 139 217 380 50 55 ...
 $ campaign : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

DATA PREPROCESSING

Check For Missing Values

sum(is.na(bmarketing))
[1] 0
Here we found that no missing values are available in whole dataset.

Check For Outliers

boxplot(bmarketing)

We can see from the above boxplot that there are outliers available in certain variables whereas in variables 'balance','duration' and 'previous' we have some extreme outliers which are to be removed as these effect our model performance.

Remove extreme outliers

dim(bmarketing)
[1] 45211    17
bmarketing<-bmarketing[!bmarketing$balance>5e+04,]
bmarketing<-bmarketing[!bmarketing$balance==-8019,]
bmarketing<-bmarketing[!bmarketing$balance==-6847,]
bmarketing<-bmarketing[!bmarketing$duration>3500,]
bmarketing<-bmarketing[!bmarketing$previous>50,]
boxplot(bmarketing)

dim(bmarketing)
[1] 45184    17
Now after omitting some extreme outliers the dimension of the data is reduced to 45184 observations, this 45184 data is splitted further into train data,test1 data and test2 data.

RESAMPLING THE PRE-PROCESSED DATA

library(caret)
Loading required package: lattice
Loading required package: ggplot2
split<-createDataPartition(bmarketing$y,p=0.6,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 27112    17
dim(test1)
[1] 9036   17
dim(test2)
[1] 9036   17
Here the 45184 observations are splitted to 27112 observations as train data,9036 observations as test1 and test2 data each.
Now lets perform the model research on these data where model is built using train data and best model is choosen out of it and further tested on test1 and test2 data.

BINARY LOGISTIC REGRESSION

MODEL BUILDING

library(caret)
kfoldrepeated<-trainControl(method = "repeatedcv",number = 5,repeats = 3)
model_bin<-train(y ~ .,data=train,method="glm",trControl=kfoldrepeated)
model_bin
Generalized Linear Model 

27112 samples
   16 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 21689, 21690, 21689, 21690, 21690, 21689, ... 
Resampling results:

  Accuracy   Kappa    
  0.8991959  0.3816199
 Here the model is built using k fold repeated cross validation multiple times(usually 3 times)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_bin,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23351   591
       yes  2134  1036
                                         
               Accuracy : 0.8995         
                 95% CI : (0.8959, 0.903)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.383          
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.9163         
            Specificity : 0.6368         
         Pos Pred Value : 0.9753         
         Neg Pred Value : 0.3268         
             Prevalence : 0.9400         
         Detection Rate : 0.8613         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.7765         
                                         
       'Positive' Class : no             
                                         
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23351 |       591 |     23942 | 
                                 |     0.861 |     0.022 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      2134 |      1036 |      3170 | 
                                 |     0.079 |     0.038 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25485 |      1627 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
Loading required package: gplots

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6510646
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here the model performed well with 90% accuracy (predicting 86% of not accepting term deposit and 4% accepting term deposit correctly),but inter rater reliability being low and specificity upto the mark and area under curve of 66%.

NAIVE BAYES CLASSIFIER

MODEL BUILDING

library(naivebayes)
model_nb<-naive_bayes(y ~ ., data=train)
model_nb$prior

       no       yes 
0.8830776 0.1169224 
The model is built by calculating the prior probabilities with 88% of not accepting subscription and 11% accepting subscription with this it calculates posterior probabilities as it is a probabilistic model.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_nb,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  22007  1935
       yes  1525  1645
                                          
               Accuracy : 0.8724          
                 95% CI : (0.8684, 0.8763)
    No Information Rate : 0.868           
    P-Value [Acc > NIR] : 0.01571         
                                          
                  Kappa : 0.4148          
 Mcnemar's Test P-Value : 3.571e-12       
                                          
            Sensitivity : 0.9352          
            Specificity : 0.4595          
         Pos Pred Value : 0.9192          
         Neg Pred Value : 0.5189          
             Prevalence : 0.8680          
         Detection Rate : 0.8117          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.6973          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     22007 |      1935 |     23942 | 
                                 |     0.812 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1525 |      1645 |      3170 | 
                                 |     0.056 |     0.061 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     23532 |      3580 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7190536
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model has 87% accuracy but inter rater reliability and specificity is very low.

CART DECISION TREE

MODEL BUILDING

library(rpart)
model_cart<-rpart(y ~ ., data=train,control = rpart.control(minbucket = 10))
model_cart$cptable
          CP nsplit rel error    xerror       xstd
1 0.03417455      0 1.0000000 1.0000000 0.01669052
2 0.02839117      3 0.8974763 0.9176656 0.01607557
3 0.02050473      4 0.8690852 0.8772871 0.01575943
4 0.01000000      5 0.8485804 0.8577287 0.01560261
library(rpart.plot)
rpart.plot(model_cart)

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     22007 |      1935 |     23942 | 
                                 |     0.812 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1525 |      1645 |      3170 | 
                                 |     0.056 |     0.061 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     23532 |      3580 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
Here the model is built with different complexity parameter and calculating its respective errors and selecting the cp with low error (cp=0.01 is selected)
From cross table we can read that 80% of not accepting subscription and 6% of accepting subscription is predicted. 
From the tree plotted the variable duration is selected as node with low GINI index.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_cart,train,type="class")
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23327   615
       yes  2075  1095
                                          
               Accuracy : 0.9008          
                 95% CI : (0.8972, 0.9043)
    No Information Rate : 0.9369          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3996          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9183          
            Specificity : 0.6404          
         Pos Pred Value : 0.9743          
         Neg Pred Value : 0.3454          
             Prevalence : 0.9369          
         Detection Rate : 0.8604          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7793          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23327 |       615 |     23942 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      2075 |      1095 |      3170 | 
                                 |     0.077 |     0.040 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25402 |      1710 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6598694
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The performance of CART decision tree model is good with 90% accurate,but kappa is low and specificity being bit low,the area under curve is 68%.

C50 DECISION TREE

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23405   537
       yes  1311  1859
                                          
               Accuracy : 0.9318          
                 95% CI : (0.9288, 0.9348)
    No Information Rate : 0.9116          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6308          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9470          
            Specificity : 0.7759          
         Pos Pred Value : 0.9776          
         Neg Pred Value : 0.5864          
             Prevalence : 0.9116          
         Detection Rate : 0.8633          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8614          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23405 |       537 |     23942 | 
                                 |     0.863 |     0.020 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1311 |      1859 |      3170 | 
                                 |     0.048 |     0.069 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     24716 |      2396 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7820031
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model is accurate of 93%,kappa is also good and other parameters being performed well , it is a good model.

RANDOM FOREST

MODEL BUILDING

library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
model_rf<-randomForest(y ~ ., data=train,mtry=3)
model_rf

Call:
 randomForest(formula = y ~ ., data = train, mtry = 3) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 9.56%
Confusion matrix:
       no  yes class.error
no  23186  756  0.03157631
yes  1835 1335  0.57886435
plot(model_rf)

varImpPlot(model_rf)

The random forest model has 9% error on test data(68% of train data) which is the out of bag error.
From the model plot we can see that the error rate was decreasing as no of trees was increasing and from the variable importance plot we can infer that the variable with low GINI index has more importance. 

PREDICT ON TRAIN DATA

pred_train<-predict(model_rf,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23942     0
       yes   128  3042
                                          
               Accuracy : 0.9953          
                 95% CI : (0.9944, 0.9961)
    No Information Rate : 0.8878          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9767          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9947          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9596          
             Prevalence : 0.8878          
         Detection Rate : 0.8831          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.9973          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23942 |         0 |     23942 | 
                                 |     0.883 |     0.000 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       128 |      3042 |      3170 | 
                                 |     0.005 |     0.112 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     24070 |      3042 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9798107
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model performed excellent by predicting on train data with 99% accuracy and other parameters were totally satisfactory but we need to check whether these parameters are consistently performing well on different combination of bmarketing dataset.

ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(y ~ ., data=train,loss='exponential',type='discrete',iter=100)
model_ada
Call:
ada(y ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    no   yes
       no  23281   661
       yes  1792  1378

Train Error: 0.09 

Out-Of-Bag Error:  0.091  iteration= 88 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        95         95 
plot(model_ada)

Here the exponential method is used to build the model with 100 iterations where test data error being only 9% and we can see that error rate was decreasing with increase in no of iterations from the plot.

PREDICT ON TRAIN DATA

pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23281   661
       yes  1792  1378
                                         
               Accuracy : 0.9095         
                 95% CI : (0.906, 0.9129)
    No Information Rate : 0.9248         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.4816         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.9285         
            Specificity : 0.6758         
         Pos Pred Value : 0.9724         
         Neg Pred Value : 0.4347         
             Prevalence : 0.9248         
         Detection Rate : 0.8587         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.8022         
                                         
       'Positive' Class : no             
                                         
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23281 |       661 |     23942 | 
                                 |     0.859 |     0.024 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1792 |      1378 |      3170 | 
                                 |     0.066 |     0.051 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25073 |      2039 |     27112 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.703546
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model performed well with accuracy of 91% but kappa was low as 49% and remaining parameters were also satisfactory.

Now lets check the summary of performance of all the above models built and tested on train data.

SUMMARY

SUMMARY ON VARIOUS MODELS ON TRAIN DATA
MODEL ACCURACY KAPPA SENSITIVITY SPECIFICITY BALANCED_ACCURACY AREA_UNDER_CURVE NO_OF_VALUES_PREDECTED_CORRECTLY
. . . . . . . (out_of_27112-values)
BINARY_LOGISTIC_REGRESSION 90 40.5 91.8 65 78 66 24452
NAIVE_BAYES 87 41.8 93.7 45.5 69.6 72.5 23607
CART_DECISION_TREE 90 43.4 92.4 62.4 77.4 68.3 24444
C50_DECISION_TREE 93 62.6 94.6 76.9 85.7 78 25235
RANDOM_FOREST 99 97 99 100 99 97.7 26970
ADAPTIVE_BOOSTING 91 49 93 67 80 71 24690

From the summary we can infer that the random forest algorithm performed excellent above all the models and next the c50 decision tree performed well and other models performed satisfactory as the kappa being bit low. So lets shortlist the random forest model and c50 decision tree model and check whether their performance are consistent by performing on tets1 and test2 data so as to minimize bias-variance trade off.

USING RANDOM FOREST

PREDICT ON TEST1 DATA

pred_test1<-predict(model_rf,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7746  234
       yes  617  439
                                          
               Accuracy : 0.9058          
                 95% CI : (0.8996, 0.9118)
    No Information Rate : 0.9255          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4585          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9262          
            Specificity : 0.6523          
         Pos Pred Value : 0.9707          
         Neg Pred Value : 0.4157          
             Prevalence : 0.9255          
         Detection Rate : 0.8572          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7893          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7746 |       234 |      7980 | 
                                 |     0.857 |     0.026 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       617 |       439 |      1056 | 
                                 |     0.068 |     0.049 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8363 |       673 |      9036 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6931982
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_rf,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7774  206
       yes  552  504
                                          
               Accuracy : 0.9161          
                 95% CI : (0.9102, 0.9217)
    No Information Rate : 0.9214          
    P-Value [Acc > NIR] : 0.9701          
                                          
                  Kappa : 0.5263          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9337          
            Specificity : 0.7099          
         Pos Pred Value : 0.9742          
         Neg Pred Value : 0.4773          
             Prevalence : 0.9214          
         Detection Rate : 0.8603          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8218          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7774 |       206 |      7980 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       552 |       504 |      1056 | 
                                 |     0.061 |     0.056 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8326 |       710 |      9036 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7257291
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

USING C50 DECISION TREE

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7677  303
       yes  576  480
                                          
               Accuracy : 0.9027          
                 95% CI : (0.8964, 0.9088)
    No Information Rate : 0.9133          
    P-Value [Acc > NIR] : 0.9998          
                                          
                  Kappa : 0.4692          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9302          
            Specificity : 0.6130          
         Pos Pred Value : 0.9620          
         Neg Pred Value : 0.4545          
             Prevalence : 0.9133          
         Detection Rate : 0.8496          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7716          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7677 |       303 |      7980 | 
                                 |     0.850 |     0.034 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       576 |       480 |      1056 | 
                                 |     0.064 |     0.053 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8253 |       783 |      9036 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7082878
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7685  295
       yes  553  503
                                       
               Accuracy : 0.9062       
                 95% CI : (0.9, 0.9121)
    No Information Rate : 0.9117       
    P-Value [Acc > NIR] : 0.9686       
                                       
                  Kappa : 0.4914       
 Mcnemar's Test P-Value : <2e-16       
                                       
            Sensitivity : 0.9329       
            Specificity : 0.6303       
         Pos Pred Value : 0.9630       
         Neg Pred Value : 0.4763       
             Prevalence : 0.9117       
         Detection Rate : 0.8505       
   Detection Prevalence : 0.8831       
      Balanced Accuracy : 0.7816       
                                       
       'Positive' Class : no           
                                       
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7685 |       295 |      7980 | 
                                 |     0.850 |     0.033 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       553 |       503 |      1056 | 
                                 |     0.061 |     0.056 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8238 |       798 |      9036 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7196792
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

SUMMARY

SUMMARY ON TRAIN,TEST1 AND TEST2 DATA OF 2 SHORTLISTED MODELS
MODEL ACCURACY KAPPA SENSITIVITY SPECIFICITY BALANCED_ACCURACY AREA_UNDER_CURVE NO_OF_VALUES_PREDECTED_CORRECTLY
RANDOM_FOREST_TRAIN 99.0 97.0 99.0 100.0 99.0 97.7 26970
TEST1 91.0 48.0 93.0 67.0 80.0 70.7 8219
TEST2 91.0 47.0 92.7 66.9 79.8 69.7 8205
C50_TRAIN 93.0 62.6 94.6 76.9 85.7 78.0 25235
TEST1 90.5 48.0 93.0 63.0 78.0 71.0 8179
TEST2 90.0 45.0 92.8 60.0 76.4 70.0 8134
NOTE:The no of values predicted correctly are out of 27112 in train data and 9036 in test1 and test2 data each

From above summary,we can infer that the random forest model has more bias-variance tradeof when compared with c50 decision tree model, except sensitivity reamaining parameters are not consistent for random forest,whereas c50 decision tree shows consistent performance for all the data,so lets use c50 decision tree model and check the performance for some shuffled data.

USING C50 DECISION TREE

SHUFFLE THE DATA

set.seed(123)

RESAMPLING THE DATA

library(caret)
split<-createDataPartition(bmarketing$y,p=0.5,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 22592    17
dim(test1)
[1] 11297    17
dim(test2)
[1] 11295    17

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  19496   455
       yes  1042  1599
                                          
               Accuracy : 0.9337          
                 95% CI : (0.9304, 0.9369)
    No Information Rate : 0.9091          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6448          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9493          
            Specificity : 0.7785          
         Pos Pred Value : 0.9772          
         Neg Pred Value : 0.6055          
             Prevalence : 0.9091          
         Detection Rate : 0.8630          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8639          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  22592 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     19496 |       455 |     19951 | 
                                 |     0.863 |     0.020 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1042 |      1599 |      2641 | 
                                 |     0.046 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     20538 |      2054 |     22592 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7913233
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9575  401
       yes  722  599
                                          
               Accuracy : 0.9006          
                 95% CI : (0.8949, 0.9061)
    No Information Rate : 0.9115          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4619          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9299          
            Specificity : 0.5990          
         Pos Pred Value : 0.9598          
         Neg Pred Value : 0.4534          
             Prevalence : 0.9115          
         Detection Rate : 0.8476          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7644          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  11297 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      9575 |       401 |      9976 | 
                                 |     0.848 |     0.035 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       722 |       599 |      1321 | 
                                 |     0.064 |     0.053 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     10297 |      1000 |     11297 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7066239
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9593  382
       yes  711  609
                                          
               Accuracy : 0.9032          
                 95% CI : (0.8976, 0.9086)
    No Information Rate : 0.9123          
    P-Value [Acc > NIR] : 0.9996          
                                          
                  Kappa : 0.4744          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9310          
            Specificity : 0.6145          
         Pos Pred Value : 0.9617          
         Neg Pred Value : 0.4614          
             Prevalence : 0.9123          
         Detection Rate : 0.8493          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7728          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  11295 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      9593 |       382 |      9975 | 
                                 |     0.849 |     0.034 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       711 |       609 |      1320 | 
                                 |     0.063 |     0.054 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     10304 |       991 |     11295 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7115339
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(1234)

RESAMPLING THE DATA

library(caret)
split<-createDataPartition(bmarketing$y,p=0.7,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 31630    17
dim(test1)
[1] 6777   17
dim(test2)
[1] 6777   17

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  27208   724
       yes  1450  2248
                                         
               Accuracy : 0.9313         
                 95% CI : (0.9284, 0.934)
    No Information Rate : 0.906          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6362         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9494         
            Specificity : 0.7564         
         Pos Pred Value : 0.9741         
         Neg Pred Value : 0.6079         
             Prevalence : 0.9060         
         Detection Rate : 0.8602         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.8529         
                                         
       'Positive' Class : no             
                                         
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  31630 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     27208 |       724 |     27932 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1450 |      2248 |      3698 | 
                                 |     0.046 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     28658 |      2972 |     31630 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.790988
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5729  256
       yes  400  392
                                          
               Accuracy : 0.9032          
                 95% CI : (0.8959, 0.9101)
    No Information Rate : 0.9044          
    P-Value [Acc > NIR] : 0.6391          
                                          
                  Kappa : 0.4909          
 Mcnemar's Test P-Value : 2.361e-08       
                                          
            Sensitivity : 0.9347          
            Specificity : 0.6049          
         Pos Pred Value : 0.9572          
         Neg Pred Value : 0.4949          
             Prevalence : 0.9044          
         Detection Rate : 0.8454          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7698          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  6777 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      5729 |       256 |      5985 | 
                                 |     0.845 |     0.038 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       400 |       392 |       792 | 
                                 |     0.059 |     0.058 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      6129 |       648 |      6777 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7260879
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5732  253
       yes  399  393
                                          
               Accuracy : 0.9038          
                 95% CI : (0.8965, 0.9107)
    No Information Rate : 0.9047          
    P-Value [Acc > NIR] : 0.608           
                                          
                  Kappa : 0.4934          
 Mcnemar's Test P-Value : 1.358e-08       
                                          
            Sensitivity : 0.9349          
            Specificity : 0.6084          
         Pos Pred Value : 0.9577          
         Neg Pred Value : 0.4962          
             Prevalence : 0.9047          
         Detection Rate : 0.8458          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7716          
                                          
       'Positive' Class : no              
                                          
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  6777 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      5732 |       253 |      5985 | 
                                 |     0.846 |     0.037 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       399 |       393 |       792 | 
                                 |     0.059 |     0.058 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      6131 |       646 |      6777 | 
---------------------------------|-----------|-----------|-----------|

 
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7269699
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Lets summarize the performance of c50 decision tree model on different combination of shuffled data.

FINAL SUMMARY

FINAL SUMMARY OF C50 DECISION TREE
PARAMETERS TRAIN TEST1 TEST2 TRAIN.1 TEST1.1 TEST2.1 TRAIN.2 TEST1.2 TEST2.2
ACCURACY 93.0 90.5 90.0 93.3 90.0 90 93.0 90.0 90.4
KAPPA 62.6 48.0 45.0 64.5 46.0 47 63.6 49.0 49.0
SENSITIVITY 94.6 93.0 92.8 95.0 93.0 93 95.0 93.5 93.5
SPECIFICITY 76.9 63.0 60.0 73.8 60.0 61 75.6 60.5 60.8
BALANCED_ACCURACY 85.7 78.0 76.4 86.0 76.0 77 85.0 77.0 77.0
AREA_UNDER_CURVE 78.0 71.0 70.0 79.0 70.7 71 79.0 72.6 72.7

CONCLUSION

From the above summary we can infer that the model performed well on different combination of marketing dataset consistently with less chances of overfitting(bias) and error rate(variance) and with an average accuracy of 90% and allowing only 10% error,it will efficiently predict the subscription of bank term deposit of a customer and we can conclude that the c50 decision tree model has a good potential to predict well on the future data too and therefore it is the best model and is ready to be deployed.

DATASET WITH PREDICTED VALUES


Attaching package: 'dplyr'
The following object is masked from 'package:randomForest':

    combine
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
FINAL DATASET WITH OBSERVED AND EXPECTED SUBSCRIPTION BY THE CUSTOMER (displaying only first 15 observations)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome observed_y expected_y
58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no no
44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no no
33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no no
47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no no
33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no no
35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no no
28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no no
42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no no
58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no no
43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no no
41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no no
29 admin. single secondary no 390 yes no unknown 5 may 137 1 -1 0 unknown no no
53 technician married secondary no 6 yes no unknown 5 may 517 1 -1 0 unknown no no
58 technician married unknown no 71 yes no unknown 5 may 71 1 -1 0 unknown no no
57 services married secondary no 162 yes no unknown 5 may 174 1 -1 0 unknown no no