DEFINE PROBLEM STATEMENT

Will the client subscribe the bank term deposit sponsored by Portuguese banking institution?

DATASET INFORMATION

The data is related with direct marketing campaigns on bank term deposit of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,in order to access if the product (i.e, bank term deposit) would be subscribed/not.

BANK MARKETING(DISPLAYING ONLY 12 OBSERVATIONS
age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no
28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no
42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no
41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no
29	admin.	single	secondary	no	390	yes	no	unknown	5	may	137	1	-1	unknown	no

ATTRIBUTES INFORMATION

# bank client data:

age : age of the client (numeric)

job : type of a job (categorical)(“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)

marital : marital status (categorical)(“married”,“divorced”,“single”)

education : (categorical)(“unknown”,“secondary”,“primary”,“tertiary”)

default: has credit in default? (binary: “yes”,“no”)

balance: average yearly balance in euros (numeric)

housing: has housing loan? (binary: “yes”,“no”)

loan: has personal loan? (binary: “yes”,“no”)

# related with the last contact of the current campaign:

contact: contact communication type (categorical)(“unknown”,“telephone”,“cellular”)

day: last contact day of the month (numeric)

month: last contact month of year (categorical)(“jan”, “feb”, “mar”, …, “nov”, “dec”)

duration: last contact duration, in seconds (numeric)

# other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric),it includes last contact

pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric)(-1 means client was not previously contacted)

previous: number of contacts performed before this campaign and for this client (numeric)

poutcome: outcome of the previous marketing campaign (categorical)(“unknown”,“other”,“failure”,“success”)

y : has the client subscribed a term deposit? (binary: “yes”,“no”)

Here we have 45211 observations with 17 variables out of which the y variable is an outcome variable with binary classes and remaining 16 variables are the predictors.

summary(bmarketing)

      age                 job           marital          education    
 Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
 1st Qu.:33.00   management :9458   married :27214   secondary:23202  
 Median :39.00   technician :7597   single  :12790   tertiary :13301  
 Mean   :40.94   admin.     :5171                    unknown  : 1857  
 3rd Qu.:48.00   services   :4154                                     
 Max.   :95.00   retired    :2264                                     
                 (Other)    :6835                                     
 default        balance       housing      loan            contact     
 no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
 yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
             Median :   448                           unknown  :13020  
             Mean   :  1362                                            
             3rd Qu.:  1428                                            
             Max.   :102127                                            
                                                                       
      day            month          duration         campaign     
 Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
 1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
 Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
 Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
 3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
 Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
                 (Other): 6060                                    
     pdays          previous           poutcome       y        
 Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
 1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
 Median : -1.0   Median :  0.0000   success: 1511              
 Mean   : 40.2   Mean   :  0.5803   unknown:36959              
 3rd Qu.: -1.0   3rd Qu.:  0.0000                              
 Max.   :871.0   Max.   :275.0000

bmarketing$age<-as.numeric(bmarketing$age)
bmarketing$balance<-as.numeric(bmarketing$balance)
bmarketing$day<-as.numeric(bmarketing$day)
bmarketing$duration<-as.numeric(bmarketing$duration)
bmarketing$campaign<-as.numeric(bmarketing$campaign)
bmarketing$pdays<-as.numeric(bmarketing$pdays)
bmarketing$previous<-as.numeric(bmarketing$previous)
str(bmarketing)

'data.frame':   45211 obs. of  17 variables:
 $ age      : num  58 44 33 47 33 35 28 42 58 43 ...
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ balance  : num  2143 29 2 1506 1 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ day      : num  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ duration : num  261 151 76 92 198 139 217 380 50 55 ...
 $ campaign : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

DATA PREPROCESSING

Check For Missing Values

sum(is.na(bmarketing))

[1] 0

Here we found that no missing values are available in whole dataset.

Check For Outliers

boxplot(bmarketing)

We can see from the above boxplot that there are outliers available in certain variables whereas in variables 'balance','duration' and 'previous' we have some extreme outliers which are to be removed as these effect our model performance.

Remove extreme outliers

dim(bmarketing)

[1] 45211    17

bmarketing<-bmarketing[!bmarketing$balance>5e+04,]
bmarketing<-bmarketing[!bmarketing$balance==-8019,]
bmarketing<-bmarketing[!bmarketing$balance==-6847,]
bmarketing<-bmarketing[!bmarketing$duration>3500,]
bmarketing<-bmarketing[!bmarketing$previous>50,]
boxplot(bmarketing)

dim(bmarketing)

[1] 45184    17

Now after omitting some extreme outliers the dimension of the data is reduced to 45184 observations, this 45184 data is splitted further into train data,test1 data and test2 data.

RESAMPLING THE PRE-PROCESSED DATA

library(caret)

Loading required package: lattice

Loading required package: ggplot2

split<-createDataPartition(bmarketing$y,p=0.6,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)

[1] 27112    17

dim(test1)

[1] 9036   17

dim(test2)

[1] 9036   17

Here the 45184 observations are splitted to 27112 observations as train data,9036 observations as test1 and test2 data each.
Now lets perform the model research on these data where model is built using train data and best model is choosen out of it and further tested on test1 and test2 data.

BINARY LOGISTIC REGRESSION

MODEL BUILDING

library(caret)
kfoldrepeated<-trainControl(method = "repeatedcv",number = 5,repeats = 3)
model_bin<-train(y ~ .,data=train,method="glm",trControl=kfoldrepeated)
model_bin

Generalized Linear Model 

27112 samples
   16 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 21689, 21690, 21689, 21690, 21690, 21689, ... 
Resampling results:

  Accuracy   Kappa    
  0.8991959  0.3816199

 Here the model is built using k fold repeated cross validation multiple times(usually 3 times)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_bin,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23351   591
       yes  2134  1036
                                         
               Accuracy : 0.8995         
                 95% CI : (0.8959, 0.903)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.383          
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.9163         
            Specificity : 0.6368         
         Pos Pred Value : 0.9753         
         Neg Pred Value : 0.3268         
             Prevalence : 0.9400         
         Detection Rate : 0.8613         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.7765         
                                         
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23351 |       591 |     23942 | 
                                 |     0.861 |     0.022 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      2134 |      1036 |      3170 | 
                                 |     0.079 |     0.038 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25485 |      1627 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)

Loading required package: gplots


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6510646

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here the model performed well with 90% accuracy (predicting 86% of not accepting term deposit and 4% accepting term deposit correctly),but inter rater reliability being low and specificity upto the mark and area under curve of 66%.

NAIVE BAYES CLASSIFIER

MODEL BUILDING

library(naivebayes)
model_nb<-naive_bayes(y ~ ., data=train)
model_nb$prior


       no       yes 
0.8830776 0.1169224

The model is built by calculating the prior probabilities with 88% of not accepting subscription and 11% accepting subscription with this it calculates posterior probabilities as it is a probabilistic model.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_nb,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  22007  1935
       yes  1525  1645
                                          
               Accuracy : 0.8724          
                 95% CI : (0.8684, 0.8763)
    No Information Rate : 0.868           
    P-Value [Acc > NIR] : 0.01571         
                                          
                  Kappa : 0.4148          
 Mcnemar's Test P-Value : 3.571e-12       
                                          
            Sensitivity : 0.9352          
            Specificity : 0.4595          
         Pos Pred Value : 0.9192          
         Neg Pred Value : 0.5189          
             Prevalence : 0.8680          
         Detection Rate : 0.8117          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.6973          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     22007 |      1935 |     23942 | 
                                 |     0.812 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1525 |      1645 |      3170 | 
                                 |     0.056 |     0.061 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     23532 |      3580 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7190536

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model has 87% accuracy but inter rater reliability and specificity is very low.

CART DECISION TREE

MODEL BUILDING

library(rpart)
model_cart<-rpart(y ~ ., data=train,control = rpart.control(minbucket = 10))
model_cart$cptable

          CP nsplit rel error    xerror       xstd
1 0.03417455      0 1.0000000 1.0000000 0.01669052
2 0.02839117      3 0.8974763 0.9176656 0.01607557
3 0.02050473      4 0.8690852 0.8772871 0.01575943
4 0.01000000      5 0.8485804 0.8577287 0.01560261

library(rpart.plot)
rpart.plot(model_cart)

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     22007 |      1935 |     23942 | 
                                 |     0.812 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1525 |      1645 |      3170 | 
                                 |     0.056 |     0.061 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     23532 |      3580 |     27112 | 
---------------------------------|-----------|-----------|-----------|

Here the model is built with different complexity parameter and calculating its respective errors and selecting the cp with low error (cp=0.01 is selected)
From cross table we can read that 80% of not accepting subscription and 6% of accepting subscription is predicted. 
From the tree plotted the variable duration is selected as node with low GINI index.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_cart,train,type="class")
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23327   615
       yes  2075  1095
                                          
               Accuracy : 0.9008          
                 95% CI : (0.8972, 0.9043)
    No Information Rate : 0.9369          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3996          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9183          
            Specificity : 0.6404          
         Pos Pred Value : 0.9743          
         Neg Pred Value : 0.3454          
             Prevalence : 0.9369          
         Detection Rate : 0.8604          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7793          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23327 |       615 |     23942 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      2075 |      1095 |      3170 | 
                                 |     0.077 |     0.040 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25402 |      1710 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6598694

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The performance of CART decision tree model is good with 90% accurate,but kappa is low and specificity being bit low,the area under curve is 68%.

C50 DECISION TREE

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23405   537
       yes  1311  1859
                                          
               Accuracy : 0.9318          
                 95% CI : (0.9288, 0.9348)
    No Information Rate : 0.9116          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6308          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9470          
            Specificity : 0.7759          
         Pos Pred Value : 0.9776          
         Neg Pred Value : 0.5864          
             Prevalence : 0.9116          
         Detection Rate : 0.8633          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8614          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23405 |       537 |     23942 | 
                                 |     0.863 |     0.020 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1311 |      1859 |      3170 | 
                                 |     0.048 |     0.069 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     24716 |      2396 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7820031

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model is accurate of 93%,kappa is also good and other parameters being performed well , it is a good model.

RANDOM FOREST

MODEL BUILDING

library(randomForest)

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

model_rf<-randomForest(y ~ ., data=train,mtry=3)
model_rf


Call:
 randomForest(formula = y ~ ., data = train, mtry = 3) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 9.56%
Confusion matrix:
       no  yes class.error
no  23186  756  0.03157631
yes  1835 1335  0.57886435

plot(model_rf)

varImpPlot(model_rf)

The random forest model has 9% error on test data(68% of train data) which is the out of bag error.
From the model plot we can see that the error rate was decreasing as no of trees was increasing and from the variable importance plot we can infer that the variable with low GINI index has more importance.

PREDICT ON TRAIN DATA

pred_train<-predict(model_rf,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23942     0
       yes   128  3042
                                          
               Accuracy : 0.9953          
                 95% CI : (0.9944, 0.9961)
    No Information Rate : 0.8878          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9767          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9947          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9596          
             Prevalence : 0.8878          
         Detection Rate : 0.8831          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.9973          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23942 |         0 |     23942 | 
                                 |     0.883 |     0.000 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       128 |      3042 |      3170 | 
                                 |     0.005 |     0.112 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     24070 |      3042 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9798107

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model performed excellent by predicting on train data with 99% accuracy and other parameters were totally satisfactory but we need to check whether these parameters are consistently performing well on different combination of bmarketing dataset.

ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(y ~ ., data=train,loss='exponential',type='discrete',iter=100)
model_ada

Call:
ada(y ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    no   yes
       no  23281   661
       yes  1792  1378

Train Error: 0.09 

Out-Of-Bag Error:  0.091  iteration= 88 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        95         95

plot(model_ada)

Here the exponential method is used to build the model with 100 iterations where test data error being only 9% and we can see that error rate was decreasing with increase in no of iterations from the plot.

PREDICT ON TRAIN DATA

pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  23281   661
       yes  1792  1378
                                         
               Accuracy : 0.9095         
                 95% CI : (0.906, 0.9129)
    No Information Rate : 0.9248         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.4816         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.9285         
            Specificity : 0.6758         
         Pos Pred Value : 0.9724         
         Neg Pred Value : 0.4347         
             Prevalence : 0.9248         
         Detection Rate : 0.8587         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.8022         
                                         
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  27112 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     23281 |       661 |     23942 | 
                                 |     0.859 |     0.024 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1792 |      1378 |      3170 | 
                                 |     0.066 |     0.051 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     25073 |      2039 |     27112 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.703546

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The model performed well with accuracy of 91% but kappa was low as 49% and remaining parameters were also satisfactory.

Now lets check the summary of performance of all the above models built and tested on train data.

SUMMARY

SUMMARY ON VARIOUS MODELS ON TRAIN DATA
MODEL	ACCURACY	KAPPA	SENSITIVITY	SPECIFICITY	BALANCED_ACCURACY	AREA_UNDER_CURVE	NO_OF_VALUES_PREDECTED_CORRECTLY
.	.	.	.	.	.	.	(out_of_27112-values)
BINARY_LOGISTIC_REGRESSION	90	40.5	91.8	65	78	66	24452
NAIVE_BAYES	87	41.8	93.7	45.5	69.6	72.5	23607
CART_DECISION_TREE	90	43.4	92.4	62.4	77.4	68.3	24444
C50_DECISION_TREE	93	62.6	94.6	76.9	85.7	78	25235
RANDOM_FOREST	99	97	99	100	99	97.7	26970
ADAPTIVE_BOOSTING	91	49	93	67	80	71	24690

From the summary we can infer that the random forest algorithm performed excellent above all the models and next the c50 decision tree performed well and other models performed satisfactory as the kappa being bit low. So lets shortlist the random forest model and c50 decision tree model and check whether their performance are consistent by performing on tets1 and test2 data so as to minimize bias-variance trade off.

USING RANDOM FOREST

PREDICT ON TEST1 DATA

pred_test1<-predict(model_rf,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7746  234
       yes  617  439
                                          
               Accuracy : 0.9058          
                 95% CI : (0.8996, 0.9118)
    No Information Rate : 0.9255          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4585          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9262          
            Specificity : 0.6523          
         Pos Pred Value : 0.9707          
         Neg Pred Value : 0.4157          
             Prevalence : 0.9255          
         Detection Rate : 0.8572          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7893          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7746 |       234 |      7980 | 
                                 |     0.857 |     0.026 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       617 |       439 |      1056 | 
                                 |     0.068 |     0.049 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8363 |       673 |      9036 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6931982

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_rf,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7774  206
       yes  552  504
                                          
               Accuracy : 0.9161          
                 95% CI : (0.9102, 0.9217)
    No Information Rate : 0.9214          
    P-Value [Acc > NIR] : 0.9701          
                                          
                  Kappa : 0.5263          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9337          
            Specificity : 0.7099          
         Pos Pred Value : 0.9742          
         Neg Pred Value : 0.4773          
             Prevalence : 0.9214          
         Detection Rate : 0.8603          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8218          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7774 |       206 |      7980 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       552 |       504 |      1056 | 
                                 |     0.061 |     0.056 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8326 |       710 |      9036 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7257291

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

USING C50 DECISION TREE

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7677  303
       yes  576  480
                                          
               Accuracy : 0.9027          
                 95% CI : (0.8964, 0.9088)
    No Information Rate : 0.9133          
    P-Value [Acc > NIR] : 0.9998          
                                          
                  Kappa : 0.4692          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9302          
            Specificity : 0.6130          
         Pos Pred Value : 0.9620          
         Neg Pred Value : 0.4545          
             Prevalence : 0.9133          
         Detection Rate : 0.8496          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7716          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7677 |       303 |      7980 | 
                                 |     0.850 |     0.034 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       576 |       480 |      1056 | 
                                 |     0.064 |     0.053 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8253 |       783 |      9036 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7082878

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7685  295
       yes  553  503
                                       
               Accuracy : 0.9062       
                 95% CI : (0.9, 0.9121)
    No Information Rate : 0.9117       
    P-Value [Acc > NIR] : 0.9686       
                                       
                  Kappa : 0.4914       
 Mcnemar's Test P-Value : <2e-16       
                                       
            Sensitivity : 0.9329       
            Specificity : 0.6303       
         Pos Pred Value : 0.9630       
         Neg Pred Value : 0.4763       
             Prevalence : 0.9117       
         Detection Rate : 0.8505       
   Detection Prevalence : 0.8831       
      Balanced Accuracy : 0.7816       
                                       
       'Positive' Class : no

library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  9036 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      7685 |       295 |      7980 | 
                                 |     0.850 |     0.033 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       553 |       503 |      1056 | 
                                 |     0.061 |     0.056 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      8238 |       798 |      9036 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7196792

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

SUMMARY

SUMMARY ON TRAIN,TEST1 AND TEST2 DATA OF 2 SHORTLISTED MODELS
MODEL	ACCURACY	KAPPA	SENSITIVITY	SPECIFICITY	BALANCED_ACCURACY	AREA_UNDER_CURVE	NO_OF_VALUES_PREDECTED_CORRECTLY
RANDOM_FOREST_TRAIN	99.0	97.0	99.0	100.0	99.0	97.7	26970
TEST1	91.0	48.0	93.0	67.0	80.0	70.7	8219
TEST2	91.0	47.0	92.7	66.9	79.8	69.7	8205
C50_TRAIN	93.0	62.6	94.6	76.9	85.7	78.0	25235
TEST1	90.5	48.0	93.0	63.0	78.0	71.0	8179
TEST2	90.0	45.0	92.8	60.0	76.4	70.0	8134

NOTE:The no of values predicted correctly are out of 27112 in train data and 9036 in test1 and test2 data each

From above summary,we can infer that the random forest model has more bias-variance tradeof when compared with c50 decision tree model, except sensitivity reamaining parameters are not consistent for random forest,whereas c50 decision tree shows consistent performance for all the data,so lets use c50 decision tree model and check the performance for some shuffled data.

USING C50 DECISION TREE

SHUFFLE THE DATA

set.seed(123)

RESAMPLING THE DATA

library(caret)
split<-createDataPartition(bmarketing$y,p=0.5,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)

[1] 22592    17

dim(test1)

[1] 11297    17

dim(test2)

[1] 11295    17

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  19496   455
       yes  1042  1599
                                          
               Accuracy : 0.9337          
                 95% CI : (0.9304, 0.9369)
    No Information Rate : 0.9091          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6448          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9493          
            Specificity : 0.7785          
         Pos Pred Value : 0.9772          
         Neg Pred Value : 0.6055          
             Prevalence : 0.9091          
         Detection Rate : 0.8630          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.8639          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  22592 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     19496 |       455 |     19951 | 
                                 |     0.863 |     0.020 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1042 |      1599 |      2641 | 
                                 |     0.046 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     20538 |      2054 |     22592 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7913233

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9575  401
       yes  722  599
                                          
               Accuracy : 0.9006          
                 95% CI : (0.8949, 0.9061)
    No Information Rate : 0.9115          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4619          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9299          
            Specificity : 0.5990          
         Pos Pred Value : 0.9598          
         Neg Pred Value : 0.4534          
             Prevalence : 0.9115          
         Detection Rate : 0.8476          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7644          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  11297 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      9575 |       401 |      9976 | 
                                 |     0.848 |     0.035 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       722 |       599 |      1321 | 
                                 |     0.064 |     0.053 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     10297 |      1000 |     11297 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7066239

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  9593  382
       yes  711  609
                                          
               Accuracy : 0.9032          
                 95% CI : (0.8976, 0.9086)
    No Information Rate : 0.9123          
    P-Value [Acc > NIR] : 0.9996          
                                          
                  Kappa : 0.4744          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9310          
            Specificity : 0.6145          
         Pos Pred Value : 0.9617          
         Neg Pred Value : 0.4614          
             Prevalence : 0.9123          
         Detection Rate : 0.8493          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7728          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  11295 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      9593 |       382 |      9975 | 
                                 |     0.849 |     0.034 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       711 |       609 |      1320 | 
                                 |     0.063 |     0.054 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     10304 |       991 |     11295 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7115339

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(1234)

RESAMPLING THE DATA

library(caret)
split<-createDataPartition(bmarketing$y,p=0.7,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)

[1] 31630    17

dim(test1)

[1] 6777   17

dim(test2)

[1] 6777   17

MODEL BUILDING

library(C50)
model_c50<-C5.0(y ~ ., data=train)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  27208   724
       yes  1450  2248
                                         
               Accuracy : 0.9313         
                 95% CI : (0.9284, 0.934)
    No Information Rate : 0.906          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6362         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9494         
            Specificity : 0.7564         
         Pos Pred Value : 0.9741         
         Neg Pred Value : 0.6079         
             Prevalence : 0.9060         
         Detection Rate : 0.8602         
   Detection Prevalence : 0.8831         
      Balanced Accuracy : 0.8529         
                                         
       'Positive' Class : no

library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  31630 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |     27208 |       724 |     27932 | 
                                 |     0.860 |     0.023 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |      1450 |      2248 |      3698 | 
                                 |     0.046 |     0.071 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |     28658 |      2972 |     31630 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.790988

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST1 DATA

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5729  256
       yes  400  392
                                          
               Accuracy : 0.9032          
                 95% CI : (0.8959, 0.9101)
    No Information Rate : 0.9044          
    P-Value [Acc > NIR] : 0.6391          
                                          
                  Kappa : 0.4909          
 Mcnemar's Test P-Value : 2.361e-08       
                                          
            Sensitivity : 0.9347          
            Specificity : 0.6049          
         Pos Pred Value : 0.9572          
         Neg Pred Value : 0.4949          
             Prevalence : 0.9044          
         Detection Rate : 0.8454          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7698          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  6777 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      5729 |       256 |      5985 | 
                                 |     0.845 |     0.038 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       400 |       392 |       792 | 
                                 |     0.059 |     0.058 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      6129 |       648 |      6777 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7260879

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST2 DATA

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5732  253
       yes  399  393
                                          
               Accuracy : 0.9038          
                 95% CI : (0.8965, 0.9107)
    No Information Rate : 0.9047          
    P-Value [Acc > NIR] : 0.608           
                                          
                  Kappa : 0.4934          
 Mcnemar's Test P-Value : 1.358e-08       
                                          
            Sensitivity : 0.9349          
            Specificity : 0.6084          
         Pos Pred Value : 0.9577          
         Neg Pred Value : 0.4962          
             Prevalence : 0.9047          
         Detection Rate : 0.8458          
   Detection Prevalence : 0.8831          
      Balanced Accuracy : 0.7716          
                                          
       'Positive' Class : no

library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  6777 

 
                                 | predicted term deposit subscription 
actual term deposit subscription |        no |       yes | Row Total | 
---------------------------------|-----------|-----------|-----------|
                              no |      5732 |       253 |      5985 | 
                                 |     0.846 |     0.037 |           | 
---------------------------------|-----------|-----------|-----------|
                             yes |       399 |       393 |       792 | 
                                 |     0.059 |     0.058 |           | 
---------------------------------|-----------|-----------|-----------|
                    Column Total |      6131 |       646 |      6777 | 
---------------------------------|-----------|-----------|-----------|

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7269699

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Lets summarize the performance of c50 decision tree model on different combination of shuffled data.

FINAL SUMMARY

FINAL SUMMARY OF C50 DECISION TREE
PARAMETERS	TRAIN	TEST1	TEST2	TRAIN.1	TEST1.1	TEST2.1	TRAIN.2	TEST1.2	TEST2.2
ACCURACY	93.0	90.5	90.0	93.3	90.0	90	93.0	90.0	90.4
KAPPA	62.6	48.0	45.0	64.5	46.0	47	63.6	49.0	49.0
SENSITIVITY	94.6	93.0	92.8	95.0	93.0	93	95.0	93.5	93.5
SPECIFICITY	76.9	63.0	60.0	73.8	60.0	61	75.6	60.5	60.8
BALANCED_ACCURACY	85.7	78.0	76.4	86.0	76.0	77	85.0	77.0	77.0
AREA_UNDER_CURVE	78.0	71.0	70.0	79.0	70.7	71	79.0	72.6	72.7

CONCLUSION

From the above summary we can infer that the model performed well on different combination of marketing dataset consistently with less chances of overfitting(bias) and error rate(variance) and with an average accuracy of 90% and allowing only 10% error,it will efficiently predict the subscription of bank term deposit of a customer and we can conclude that the c50 decision tree model has a good potential to predict well on the future data too and therefore it is the best model and is ready to be deployed.

DATASET WITH PREDICTED VALUES


Attaching package: 'dplyr'

The following object is masked from 'package:randomForest':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

FINAL DATASET WITH OBSERVED AND EXPECTED SUBSCRIPTION BY THE CUSTOMER (displaying only first 15 observations)
age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	observed_y	expected_y
58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no	no
44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no	no
33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no	no
47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no	no
33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no	no
35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no	no
28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no	no
42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no	no
58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no	no
43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no	no
41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no	no
29	admin.	single	secondary	no	390	yes	no	unknown	5	may	137	1	-1	unknown	no	no
53	technician	married	secondary	no	6	yes	no	unknown	5	may	517	1	-1	unknown	no	no
58	technician	married	unknown	no	71	yes	no	unknown	5	may	71	1	-1	unknown	no	no
57	services	married	secondary	no	162	yes	no	unknown	5	may	174	1	-1	unknown	no	no

BANK MARKETING

shekar

19 August 2018

DEFINE PROBLEM STATEMENT

DATASET INFORMATION

ATTRIBUTES INFORMATION

DATA PREPROCESSING

Check For Missing Values

Check For Outliers

Remove extreme outliers

RESAMPLING THE PRE-PROCESSED DATA

BINARY LOGISTIC REGRESSION

MODEL BUILDING

PREDICTING ON TRAIN DATA

NAIVE BAYES CLASSIFIER

MODEL BUILDING

PREDICT ON TRAIN DATA

CART DECISION TREE

MODEL BUILDING

PREDICT ON TRAIN DATA

C50 DECISION TREE

MODEL BUILDING

PREDICT ON TRAIN DATA

RANDOM FOREST

MODEL BUILDING

PREDICT ON TRAIN DATA

ADAPTIVE BOOSTING

MODEL BUILDING

PREDICT ON TRAIN DATA

SUMMARY

USING RANDOM FOREST

PREDICT ON TEST1 DATA

PREDICT ON TEST2 DATA

USING C50 DECISION TREE

PREDICT ON TEST1 DATA

PREDICT ON TEST2 DATA

SUMMARY

USING C50 DECISION TREE

SHUFFLE THE DATA

RESAMPLING THE DATA

MODEL BUILDING

PREDICT ON TRAIN DATA

PREDICT ON TEST1 DATA

PREDICT ON TEST2 DATA

RESHUFFLE THE DATA

RESAMPLING THE DATA

MODEL BUILDING

PREDICT ON TRAIN DATA

PREDICT ON TEST1 DATA

PREDICT ON TEST2 DATA

FINAL SUMMARY

CONCLUSION

DATASET WITH PREDICTED VALUES