ARJUN SIROHI- MS Business Analytics
2022-07-22
About the Data
This information comes from a Portuguese banking institution’s marketing campaigns where Phone calls were used in the campaigns deployment. The bank’s customer service center calls customers to promote its term deposit product while gathering basic information
This dataset has 45211 rows and 15 columns, and the relevant columns were used on the models based on their variability importance on the dataset and having an optimal correlation that does not affect the result.
Methodology
The goal of this topic is to predict whether or not a client of this Portuguese bank will sign up for a term deposit (yes/no)
Because this is a binary classification problem, the models used for it must be apt for modeling classification problems. As a result, in this hypothesis analysis, a Logistic regression and Adaptive-Boosting models were used
ML-Classification Models
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.
Adaptive-Boosting is one of the first boosting algorithms to be adapted in solving practices. Ada-boost helps to combine multiple “weak classifiers” into a single “strong classifier”.The weak learners in Ada-boost are decision trees with a single split, called decision stumps, it works by putting more weight on difficult to classify instances and less on those already handled well and this algorithm can be used for both classification and regression problem.
Exploratory Data Analysis
job_tab <- data.frame(table(bankdata$job))
colnames(job_tab) <- c("Job", "count")
ggplot(data=job_tab, aes(x=count, y=reorder(Job,count), fill=Job))+
geom_bar(stat = 'identity')+
labs(X=NULL,
y=NULL,
title="Customers Job Distribution")+
theme_pander()Response of the recent campaign
| Response | Percentage(%) |
|---|---|
| No | 88.30 |
| Yes | 11.69 |
Response from previous campaign
p_out_tab <- data.frame(table(bankdata$poutcome,bankdata$y))
colnames(p_out_tab) <- c("PreviousOutcome", "Response","Count")
ggplot(p_out_tab, aes(x=PreviousOutcome, y=Count, fill=Response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(title=" Contact-Response Outcome")+theme_pander()+
scale_fill_manual(values=c("darkorange",
"dodgerblue4"))Response from Each occupation
job_y_tab <- data.frame(table(bankdata$job, bankdata$y))
colnames(job_y_tab) <- c("job","Response","count")
ggplot(data=job_y_tab, aes(x=count,y=reorder(job,count), fill=Response))+
geom_bar(stat = 'identity', position = 'dodge' )+
labs(X="Number of customers",
y=NULL,
title="Campaign result by Job distribution")+theme_pander()+ scale_fill_manual(values=c("yellowgreen",
"gray25"))Customer jobs and Median balance distribution
ggplot(bankdata, aes(x=balance,y=job))+
geom_boxplot(fill= "yellow2", color="red3")+
labs(y=NULL)+
theme_pander()Customer Loan Status
ggplot(data=bankdata, aes(x=loan, fill=loan))+
geom_bar(position = 'dodge')+
labs(X="Number of customers",
y=NULL)+theme_pander()+ scale_fill_manual(values=c("orange",
"gray25"))Loan Status & Customer response
loan_tab <- data.frame(table(bankdata$loan,bankdata$y))
colnames(loan_tab) <- c("Loan", "Response","Count")
ggplot(data=loan_tab, aes(x=Loan,y=Count,fill=Response))+
geom_bar(stat = 'identity',position = 'dodge')+
labs(X="Number of customers",
y=NULL)+theme_pander()+ scale_fill_manual(values=c("orange",
"red3"))Logistic Regression Performance
#Create 80/20 training/test split:
set.seed(22)
inTrain <- createDataPartition(bankdata$y, p = 0.8, list = FALSE)
bank_train <- bankdata[inTrain, ]
bank_test <- bankdata[-inTrain, ]
## Set the trainControl to 10-fold cross validation to be used across all three models:
ctrl <- trainControl(method = "cv", number = 10, classProbs = TRUE)
## Logistic Regression
log_fit <- train(y ~., data = bank_train,
method = "glm",
trControl = ctrl)
####Predictions:
pred_log <- predict(log_fit, newdata = bank_test)
confusionMatrix(pred_log, bank_test$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7805 698
## yes 179 359
##
## Accuracy : 0.903
## 95% CI : (0.8967, 0.909)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 8.351e-10
##
## Kappa : 0.4031
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9776
## Specificity : 0.3396
## Pos Pred Value : 0.9179
## Neg Pred Value : 0.6673
## Prevalence : 0.8831
## Detection Rate : 0.8633
## Detection Prevalence : 0.9405
## Balanced Accuracy : 0.6586
##
## 'Positive' Class : no
##
Important Variables
Adaptive-Boosting Model
model_ada<-ada(y ~ .,data=bank_train,loss="exponential",type="discrete",iter=50 )
##variable selection plot
varplot(model_ada)Adaptive-Boosting Performance
### prediction
pred_ada<-predict(model_ada,bank_test)
confusionMatrix(as.factor(bank_test$y),as.factor(pred_ada))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7763 221
## yes 700 357
##
## Accuracy : 0.8981
## 95% CI : (0.8917, 0.9043)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3859
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9173
## Specificity : 0.6176
## Pos Pred Value : 0.9723
## Neg Pred Value : 0.3377
## Prevalence : 0.9361
## Detection Rate : 0.8586
## Detection Prevalence : 0.8831
## Balanced Accuracy : 0.7675
##
## 'Positive' Class : no
##
Performance Analysis & Conclusion
Higher education and resources exploration would have influenced the perception of this campaign’s investment.
A higher median balance would influence their decision to invest in this campaign.
Customers with loans and mortgages are less likely to invest their money.
Finally, the logistic regression model is recommended for predicting whether or not a Portuguese bank’s clients will subscribe to a term deposit. The logistic model accuracy was higher than the Adaptive-boost model by a point margin percentage.