Read the dataset into R

This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,in order to access if the product (bank term deposit) would be (or not) subscribed.

spliting the data

split<-sample(nrow(bank),nrow(bank)*0.8)
train<-bank[split,]
test<-bank[-split,]

Exploratory Data Analysis

This is drawing insights from dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
x<-filter(bank,y=="yes")
library(ggplot2)
ggplot(bank,aes(job))+geom_bar(aes(fill=y))

Management and Technician are most clients with job description that subscribed a term deposit.

ggplot(x,aes(job))+geom_bar(aes(fill=contact))

table(bank$contact,bank$y)
##            
##                no   yes
##   cellular  24916  4369
##   telephone  2516   390
##   unknown   12490   530
ggplot(x,aes(previous))+geom_bar(aes(fill=y))

Previous number of contacts performed before this campaign and for this client has less affect on subscription of product compared to clients who has no previous number of contacts performed.

table(bank$poutcome,bank$y)
##          
##              no   yes
##   failure  4283   618
##   other    1533   307
##   success   533   978
##   unknown 33573  3386

Among the clients who subscribed product, outcome of previous contacted clients with unknown category has most subscriptions.

Model Building

As the outcome variable “y” has 2 classes “yes”,“no”,we can start from classification models i.e SVM model #SVM model

library(e1071)
## Warning: package 'e1071' was built under R version 3.5.1
model<-svm(y ~ .,data = train)
summary(model)
## 
## Call:
## svm(formula = y ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.02325581 
## 
## Number of Support Vectors:  8012
## 
##  ( 4094 3918 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
library(caret)
## Warning: package 'caret' was built under R version 3.5.1
## Loading required package: lattice
pred1<-predict(model,test)
confusionMatrix(as.factor(test$y),as.factor(pred1))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7865  158
##        yes  722  298
##                                           
##                Accuracy : 0.9027          
##                  95% CI : (0.8964, 0.9087)
##     No Information Rate : 0.9496          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3591          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9159          
##             Specificity : 0.6535          
##          Pos Pred Value : 0.9803          
##          Neg Pred Value : 0.2922          
##              Prevalence : 0.9496          
##          Detection Rate : 0.8697          
##    Detection Prevalence : 0.8872          
##       Balanced Accuracy : 0.7847          
##                                           
##        'Positive' Class : no              
## 

This model is predicting with 100% accuracy,so there will high variance while predicting future data.

We can improve this model further by using decision tree models-CART model #cart model

library(rpart)
## Warning: package 'rpart' was built under R version 3.5.1
model1<-rpart(y ~ .,data=train)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.1
rpart.plot(model1)

prediction

pred<-predict(model1,test,type = "class")
confusionMatrix(as.factor(test$y),as.factor(pred))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7800  223
##        yes  647  373
##                                           
##                Accuracy : 0.9038          
##                  95% CI : (0.8975, 0.9098)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4128          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9234          
##             Specificity : 0.6258          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.3657          
##              Prevalence : 0.9341          
##          Detection Rate : 0.8625          
##    Detection Prevalence : 0.8872          
##       Balanced Accuracy : 0.7746          
##                                           
##        'Positive' Class : no              
## 

Accuracy is looking good with almost 90% and sensitivity i.e True Positive Rate is also good.

Using advanced models

Bagging

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
model2<-train(y ~ .,data=train,method="rf",ntree =20)

prediction

pred2<-predict(model2,test)
confusionMatrix(as.factor(test$y),as.factor(pred2))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7702  321
##        yes  532  488
##                                           
##                Accuracy : 0.9057          
##                  95% CI : (0.8995, 0.9116)
##     No Information Rate : 0.9105          
##     P-Value [Acc > NIR] : 0.9486          
##                                           
##                   Kappa : 0.4819          
##  Mcnemar's Test P-Value : 6.467e-13       
##                                           
##             Sensitivity : 0.9354          
##             Specificity : 0.6032          
##          Pos Pred Value : 0.9600          
##          Neg Pred Value : 0.4784          
##              Prevalence : 0.9105          
##          Detection Rate : 0.8517          
##    Detection Prevalence : 0.8872          
##       Balanced Accuracy : 0.7693          
##                                           
##        'Positive' Class : no              
## 

Boosting

set.seed(345)
split<-sample(nrow(bank),nrow(bank)*0.8)
train_ada<-bank[split,]
test_ada<-bank[-split,]
library(ada)
## Warning: package 'ada' was built under R version 3.5.1
model3<-ada(y ~ .,data = train_ada,loss="exponential",type="discrete",iter=50 )
summary(model3)
## Call:
## ada(y ~ ., data = train_ada, loss = "exponential", type = "discrete", 
##     iter = 50)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Training Results
## 
## Accuracy: 0.907 Kappa: 0.456

prediction

pred3<-predict(model2,test_ada)
confusionMatrix(as.factor(test_ada$y),as.factor(pred3))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7930   71
##        yes  108  934
##                                          
##                Accuracy : 0.9802         
##                  95% CI : (0.9771, 0.983)
##     No Information Rate : 0.8889         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9014         
##  Mcnemar's Test P-Value : 0.007129       
##                                          
##             Sensitivity : 0.9866         
##             Specificity : 0.9294         
##          Pos Pred Value : 0.9911         
##          Neg Pred Value : 0.8964         
##              Prevalence : 0.8889         
##          Detection Rate : 0.8769         
##    Detection Prevalence : 0.8848         
##       Balanced Accuracy : 0.9580         
##                                          
##        'Positive' Class : no             
## 

This is great! ,prediction of bank term deposits that are subscribed to whom contact is made or not is good with this model.

This model is predicting 97 out 100 times predicting correctly.