This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,in order to access if the product (bank term deposit) would be (or not) subscribed.
split<-sample(nrow(bank),nrow(bank)*0.8)
train<-bank[split,]
test<-bank[-split,]
This is drawing insights from dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
x<-filter(bank,y=="yes")
library(ggplot2)
ggplot(bank,aes(job))+geom_bar(aes(fill=y))
Management and Technician are most clients with job description that subscribed a term deposit.
ggplot(x,aes(job))+geom_bar(aes(fill=contact))
table(bank$contact,bank$y)
##
## no yes
## cellular 24916 4369
## telephone 2516 390
## unknown 12490 530
ggplot(x,aes(previous))+geom_bar(aes(fill=y))
Previous number of contacts performed before this campaign and for this client has less affect on subscription of product compared to clients who has no previous number of contacts performed.
table(bank$poutcome,bank$y)
##
## no yes
## failure 4283 618
## other 1533 307
## success 533 978
## unknown 33573 3386
Among the clients who subscribed product, outcome of previous contacted clients with unknown category has most subscriptions.
As the outcome variable “y” has 2 classes “yes”,“no”,we can start from classification models i.e SVM model #SVM model
library(e1071)
## Warning: package 'e1071' was built under R version 3.5.1
model<-svm(y ~ .,data = train)
summary(model)
##
## Call:
## svm(formula = y ~ ., data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.02325581
##
## Number of Support Vectors: 8012
##
## ( 4094 3918 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
library(caret)
## Warning: package 'caret' was built under R version 3.5.1
## Loading required package: lattice
pred1<-predict(model,test)
confusionMatrix(as.factor(test$y),as.factor(pred1))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7865 158
## yes 722 298
##
## Accuracy : 0.9027
## 95% CI : (0.8964, 0.9087)
## No Information Rate : 0.9496
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3591
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9159
## Specificity : 0.6535
## Pos Pred Value : 0.9803
## Neg Pred Value : 0.2922
## Prevalence : 0.9496
## Detection Rate : 0.8697
## Detection Prevalence : 0.8872
## Balanced Accuracy : 0.7847
##
## 'Positive' Class : no
##
This model is predicting with 100% accuracy,so there will high variance while predicting future data.
We can improve this model further by using decision tree models-CART model #cart model
library(rpart)
## Warning: package 'rpart' was built under R version 3.5.1
model1<-rpart(y ~ .,data=train)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.1
rpart.plot(model1)
pred<-predict(model1,test,type = "class")
confusionMatrix(as.factor(test$y),as.factor(pred))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7800 223
## yes 647 373
##
## Accuracy : 0.9038
## 95% CI : (0.8975, 0.9098)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4128
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9234
## Specificity : 0.6258
## Pos Pred Value : 0.9722
## Neg Pred Value : 0.3657
## Prevalence : 0.9341
## Detection Rate : 0.8625
## Detection Prevalence : 0.8872
## Balanced Accuracy : 0.7746
##
## 'Positive' Class : no
##
Accuracy is looking good with almost 90% and sensitivity i.e True Positive Rate is also good.
Bagging
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
model2<-train(y ~ .,data=train,method="rf",ntree =20)
prediction
pred2<-predict(model2,test)
confusionMatrix(as.factor(test$y),as.factor(pred2))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7702 321
## yes 532 488
##
## Accuracy : 0.9057
## 95% CI : (0.8995, 0.9116)
## No Information Rate : 0.9105
## P-Value [Acc > NIR] : 0.9486
##
## Kappa : 0.4819
## Mcnemar's Test P-Value : 6.467e-13
##
## Sensitivity : 0.9354
## Specificity : 0.6032
## Pos Pred Value : 0.9600
## Neg Pred Value : 0.4784
## Prevalence : 0.9105
## Detection Rate : 0.8517
## Detection Prevalence : 0.8872
## Balanced Accuracy : 0.7693
##
## 'Positive' Class : no
##
Boosting
set.seed(345)
split<-sample(nrow(bank),nrow(bank)*0.8)
train_ada<-bank[split,]
test_ada<-bank[-split,]
library(ada)
## Warning: package 'ada' was built under R version 3.5.1
model3<-ada(y ~ .,data = train_ada,loss="exponential",type="discrete",iter=50 )
summary(model3)
## Call:
## ada(y ~ ., data = train_ada, loss = "exponential", type = "discrete",
## iter = 50)
##
## Loss: exponential Method: discrete Iteration: 50
##
## Training Results
##
## Accuracy: 0.907 Kappa: 0.456
prediction
pred3<-predict(model2,test_ada)
confusionMatrix(as.factor(test_ada$y),as.factor(pred3))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7930 71
## yes 108 934
##
## Accuracy : 0.9802
## 95% CI : (0.9771, 0.983)
## No Information Rate : 0.8889
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9014
## Mcnemar's Test P-Value : 0.007129
##
## Sensitivity : 0.9866
## Specificity : 0.9294
## Pos Pred Value : 0.9911
## Neg Pred Value : 0.8964
## Prevalence : 0.8889
## Detection Rate : 0.8769
## Detection Prevalence : 0.8848
## Balanced Accuracy : 0.9580
##
## 'Positive' Class : no
##
This is great! ,prediction of bank term deposits that are subscribed to whom contact is made or not is good with this model.
This model is predicting 97 out 100 times predicting correctly.