Executive Summary
A logistic regression model was used to predict whether or not a customer would subscribe to a long-term bank deposit, recorded as either a success or failure. Since this is a dichotomous classification problem, logistic regression was our first choice for this case study as the model is also able to provide information about the statistical significance of the predictors and their relationship to the outcome variable. In our analysis, we fit two logit models, one using stepwise selection with Akaike Information Criterion and one using Bayesian Information Criterion, and compared accuracy and sensitivity scores to find the best model. Our AIC model yielded a score of 75.6% accuracy on the testing data and 60.8% sensitivity. Although the BIC model had a slightly lower accuracy score of 73.4%, it had a higher sensitivity score of 62% and contained fewer variables, making it our final choice for this classification task.
The Problem
Bank marketing campaigns are a popular way to engage with customers, advertise a new product or service, and enhance business. Direct marketing is the approach used in this case study, which is a promotional method that involves contacting a targeted group of customers directly in order to increase sales of a product of interest. Some examples of direct marketing include text messaging, telemarketing, and email marketing. One of the important features of this strategy is the ability to measure the effectiveness of a marketing campaign by monitoring the feedback and responses from customers.
Using direct marketing campaign data from a Portuguese financial institution, our goal is to predict the likelihood of a customer subscribing to a long-term deposit. In this campaign, customers were contacted through phone calls, both inbound and outbound. Often, more than one contact to the same client was required to determine if the bank deposit would be subscribed to.
The importance of this study is to be able to identify and target customers that are likely to subscribe to long-term deposits using the predicted probabilities from our logistic regression model. This can help the bank more strategically contact clients by calling only those who are in our target group. Additionally, these probabilities can be used to segment customers into different clusters based on other marketing goals, like reaching out to customers who are at risk of leaving or switching financial institutions.
There are twenty attributes in the data set that provide information about the customer’s demographic, social, and economic status. We will explore how these features affect their likelihood to subscribe to a long-term deposit. In the remainder of the case study, we will discuss additional literature on existing methodologies used in this area, followed by a discussion of the techniques we used in this case study, and a brief description of the data set. Finally, we will present our findings from our final model and discuss our recommendations.
Methodology
In our analysis, we used a logistic regression model to fit the data and performed stepwise model selection using AIC criteria and BIC criteria to find the best subset of predictors. Akaike and Bayesian Information Criterion are two estimates of how well the model fits the data. Both methods involve minimizing the loss of information, with BIC penalizing the model more for its complexity. A lower value of these criteria indicates a better fit.
Prior to fitting the model, the data was split into two subsets, one for training the model and one for testing the performance of the model. The bank marketing data set was randomly split into two sample sets using an 80/20 ratio: training data (bank_train
) and testing data (bank_test
). The training data consists of 3,295 observations and the testing data consists of 824 observations.
The full logistic regression model was then fit using the training data set. Next, stepwise model selection was performed with AIC criteria, followed by BIC criteria. This was done in both directions using the Chi-squared test (direction = ‘both’, test = ‘Chisq’). The final model after performing AIC criteria includes seven variables: nr.employed
(number of employees), poutcome
(outcome of the previous marketing campaign), month
(last contact month of year), campaign
(number of contacts performed during this campaign and for this client), contact
(contact communication type), cons.conf.idx
(consumer confidence index), and age
. The final model after performing BIC criteria contains only two variables: nr.employed
and pdays
(the number of days that passed by after the client was last contacted from a previous campaign).
In order to classify the predictions of both the AIC and BIC models, we first converted the probabilities into binary values, 0 or 1. To find the optimal threshold level, we looked at the Sensitivity and Specificity plots for each model and chose the point where the two lines intersected one another. For the AIC model, the intersection occurs at about 0.076.
For the BIC model, this occurs at 0.146 in the graph below.
These values were used as the cutoff level for the probabilities. This means that if the probability is equal to or greater than the threshold level for the specific model, it will classify the prediction as a 1, the customer will subscribe, and if it is lower than that value, the model will classify it as 0, the customer will not subscribe.
Logistic regression models perform under several major assumptions. The first assumption is that there is a linear relationship between the independent variables and the log odds of y. Additionally, the outcome variable should be categorical, either binary or ordinal, with no high correlations (multicollinearity) among the predictors. The final assumption is that the observations are independent of each other. Failure to comply with these assumptions may lead to a poor model fit and inaccurate results.
Data
The bank marketing data set can be found and imported from the UCI Machine Learning Repository. It comprises 4,119 instances with twenty input variables (ten numeric variables and ten categorical variables) and one response variable (variable y
). The classification goal is to predict if the client will subscribe to a long-term bank deposit (yes or no) based on phone calls. The data was relatively clean to start out with. There were no missing values in the data set. Since we will be using a logistic regression model for this case study, we converted all categorical variables into factors at the beginning of our analysis.
The input variables for the logit model are divided into three categories: customer’s demographic data (age
, job
, marital
, education
, default
, housing
, loan
), data related to the current campaign (contact
, month
, day_of_week
, campaign
, pdays
, previous
, poutcome
), and data related to social and economic context (emp.var.rate
, cons.price.idx
, cons.conf.idx
, euribor3m
, nr.employed
).
The response variable, y
, was converted into a binary outcome, 1 for yes and 0 for no. When the call is made, the variable duration
is not known until the call ends and y
is known. The duration is only used for benchmarking purpose and is highly correlated to the outcome variable, y
. Therefore, we removed the variable duration
from the data set.
After checking the structure of the data, we then looked at the distributions of the variables, and there is one histogram that stood out, which is the histogram of pdays
.
According to the description of pdays
, it is the number of days that passed by after the customer was last contacted from a previous campaign. It is a numeric variable, with the number 999 indicating that the customer was not previously contacted. Looking at the histogram shown above, we think the majority of the samples are clients who were not previously contacted. We did try removing pdays
from the data set before fitting the models. However, it did not change the accuracy or sensitivity of either the AIC or BIC model, so we decided to keep this variable in the data set.
Findings
After fitting our models using the training data, we then compared the AIC and BIC models’ performance on the test data. The prediction results from the confusion matrix below show that the model using AIC criteria has the highest overall accuracy at 75.6%, with sensitivity at 60.8% and specificity at 77.2%. The model using BIC criteria has an accuracy of 73.4%, with a sensitivity at 62% and specificity at 74.6%.
# Accuracy of AIC model
confusionMatrix(as.factor(pred.aic), as.factor(bank_test$y), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 574 31
## 1 170 48
##
## Accuracy : 0.7558
## 95% CI : (0.7249, 0.7848)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2122
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60759
## Specificity : 0.77151
## Pos Pred Value : 0.22018
## Neg Pred Value : 0.94876
## Prevalence : 0.09599
## Detection Rate : 0.05832
## Detection Prevalence : 0.26488
## Balanced Accuracy : 0.68955
##
## 'Positive' Class : 1
##
# Accuracy of BIC model
confusionMatrix(as.factor(pred.bic), as.factor(bank_test$y), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 555 30
## 1 189 49
##
## Accuracy : 0.7339
## 95% CI : (0.7023, 0.7638)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1928
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.62025
## Specificity : 0.74597
## Pos Pred Value : 0.20588
## Neg Pred Value : 0.94872
## Prevalence : 0.09599
## Detection Rate : 0.05954
## Detection Prevalence : 0.28919
## Balanced Accuracy : 0.68311
##
## 'Positive' Class : 1
##
Our final choice for this case study is the Bayesian Information Criterion model as it has the highest sensitivity level and contains fewer variables, meaning that it is less complex. Below are the results of the BIC model.
glm.bic <- step(log.model.null, scope = list(upper = log.model.full),
direction="both", test="Chisq", trace = F, k=log(nrow(bank_train)))
summary(glm.bic)
##
## Call:
## glm(formula = y ~ nr.employed + pdays, family = "binomial", data = bank_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7171 -0.5619 -0.3366 -0.2824 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 56.4726824 3.9295590 14.371 < 2e-16 ***
## nr.employed -0.0111319 0.0007806 -14.261 < 2e-16 ***
## pdays -0.0014777 0.0002094 -7.058 1.69e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2323.4 on 3295 degrees of freedom
## Residual deviance: 1899.8 on 3293 degrees of freedom
## AIC: 1905.8
##
## Number of Fisher Scoring iterations: 5
Based on the table above, our final model equation is: \(\log\frac{p}{(1-p)} = 56.4726824 - 0.0111319Nr.employed - 0.0014777Pdays\)
Both predictors are negatively associated with our response variable. The odds that target customers are likely to subscribe to long-term deposits change by a factor of \(e^{-0.0111319} = 0.989\) with one unit increase in nr.employed
, when all other variables are held constant. This means that a customer is more likely to subscribe when the number of employees is lower. Additionally, the odds that target customers are likely to subscribe to long-term deposits change by a factor of \(e^{-0.0014777} = 0.999\) with one unit increase in pdays
, when all other variables are held constant. Thus, the likelihood of a customer subscribing increases when fewer days have passed since the client was last contacted from a previous campaign.
Conclusions
To achieve our goal in this case study of identifying and targeting customers that are likely to subscribe to a long-term bank deposit using the predicted probabilities from the logistic regression model, the model should focus on correctly detecting positive effects. The portion of actual positives that are correctly identified should be the primary importance. Therefore, the model should have higher sensitivity, or true positive rate, compared to the specificity, the true negative rate.
By summarizing the results of the logistic models using Akaike and Bayesian Information Criterion, we recommend the model using Bayesian criteria. Both models have similar accuracy rates but the model using BIC criteria has higher sensitivity, meaning that it predicts fewer false positives. The BIC model also contains less variables, making it less complex and our final choice for this classification task.
Appendix
library(caret)
library(ROCR)
library(ggplot2)
theme_set(theme_light())
# Load the data
bank <- read.csv("bank-additional.csv", sep = ";", stringsAsFactors = TRUE)
attach(bank)
# Change response variable to 1 for yes and 0 for no
bank$y <- ifelse(bank$y=="yes",1,0)
str(bank)
## 'data.frame': 4119 obs. of 21 variables:
## $ age : int 30 39 25 38 47 32 32 41 31 35 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
## $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num 1.31 4.86 4.96 4.96 4.19 ...
## $ nr.employed : num 5099 5191 5228 5228 5196 ...
## $ y : num 0 0 0 0 0 0 0 0 0 0 ...
## age job marital education
## Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
## 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
## Median :38.00 technician : 691 single :1153 basic.9y : 574
## Mean :40.11 services : 393 unknown : 11 professional.course: 535
## 3rd Qu.:47.00 management : 324 basic.4y : 429
## Max. :88.00 retired : 166 basic.6y : 228
## (Other) : 649 (Other) : 168
## default housing loan contact month
## no :3315 no :1839 no :3349 cellular :2652 may :1378
## unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
## yes : 1 yes :2175 yes : 665 aug : 636
## jun : 530
## nov : 446
## apr : 215
## (Other): 203
## day_of_week duration campaign pdays previous
## fri:768 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000
## mon:855 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
## thu:860 Median : 181.0 Median : 2.000 Median :999.0 Median :0.0000
## tue:841 Mean : 256.8 Mean : 2.537 Mean :960.4 Mean :0.1903
## wed:795 3rd Qu.: 317.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :3643.0 Max. :35.000 Max. :999.0 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 454 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:3523 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 142 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08497 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y
## Min. :0.635 Min. :4964 Min. :0.0000
## 1st Qu.:1.334 1st Qu.:5099 1st Qu.:0.0000
## Median :4.857 Median :5191 Median :0.0000
## Mean :3.621 Mean :5166 Mean :0.1095
## 3rd Qu.:4.961 3rd Qu.:5228 3rd Qu.:0.0000
## Max. :5.045 Max. :5228 Max. :1.0000
##
## [1] 0
##
## 0 1
## 3668 451
# Set seed for reproducibility
set.seed(1)
# Split the data into Training and Test sets
inTrain <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[inTrain, ]
bank_test <- bank[-inTrain, ]
# Null model
log.model.null <- glm(y ~ 1, data = bank_train, family = "binomial")
# Full logistic regression model
log.model.full <- glm(y ~ ., data = bank_train, family = "binomial")
summary(log.model.full)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9448 -0.3964 -0.3156 -0.2456 2.8720
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.515e+02 1.136e+02 -1.334 0.182208
## age 1.788e-02 7.612e-03 2.349 0.018804 *
## jobblue-collar -2.756e-01 2.499e-01 -1.103 0.270111
## jobentrepreneur -6.908e-01 4.583e-01 -1.507 0.131717
## jobhousemaid -2.236e-01 4.565e-01 -0.490 0.624289
## jobmanagement -5.767e-01 2.775e-01 -2.078 0.037685 *
## jobretired -1.944e-01 3.341e-01 -0.582 0.560620
## jobself-employed -7.595e-01 4.001e-01 -1.898 0.057647 .
## jobservices -2.237e-01 2.754e-01 -0.812 0.416589
## jobstudent 1.203e-01 3.826e-01 0.314 0.753278
## jobtechnician -5.621e-02 2.131e-01 -0.264 0.792001
## jobunemployed -7.829e-02 3.703e-01 -0.211 0.832583
## jobunknown -1.625e-01 6.486e-01 -0.250 0.802233
## maritalmarried 3.365e-02 2.182e-01 0.154 0.877422
## maritalsingle 1.147e-01 2.536e-01 0.452 0.651045
## maritalunknown -4.919e-03 1.197e+00 -0.004 0.996720
## educationbasic.6y 1.672e-01 3.737e-01 0.447 0.654654
## educationbasic.9y 2.308e-01 2.920e-01 0.790 0.429430
## educationhigh.school -2.213e-02 2.868e-01 -0.077 0.938486
## educationilliterate -1.235e+01 5.354e+02 -0.023 0.981600
## educationprofessional.course 5.175e-02 3.129e-01 0.165 0.868633
## educationuniversity.degree 2.082e-01 2.845e-01 0.732 0.464302
## educationunknown 1.200e-01 3.800e-01 0.316 0.752268
## defaultunknown 2.480e-02 1.971e-01 0.126 0.899877
## defaultyes -1.013e+01 5.354e+02 -0.019 0.984898
## housingunknown -2.577e-01 4.405e-01 -0.585 0.558572
## housingyes -9.526e-02 1.309e-01 -0.728 0.466633
## loanunknown NA NA NA NA
## loanyes -1.183e-01 1.790e-01 -0.661 0.508744
## contacttelephone -8.928e-01 2.618e-01 -3.411 0.000648 ***
## monthaug -7.220e-02 3.927e-01 -0.184 0.854106
## monthdec 6.675e-01 6.107e-01 1.093 0.274432
## monthjul -1.160e-01 3.367e-01 -0.344 0.730557
## monthjun 1.707e-01 4.029e-01 0.424 0.671819
## monthmar 1.740e+00 4.972e-01 3.499 0.000467 ***
## monthmay -4.285e-01 2.790e-01 -1.536 0.124614
## monthnov -5.046e-01 3.949e-01 -1.278 0.201300
## monthoct -2.757e-01 4.978e-01 -0.554 0.579661
## monthsep -2.212e-01 5.639e-01 -0.392 0.694841
## day_of_weekmon 1.131e-01 2.062e-01 0.548 0.583501
## day_of_weekthu 1.544e-01 2.096e-01 0.737 0.461389
## day_of_weektue 1.380e-01 2.093e-01 0.659 0.509837
## day_of_weekwed 3.017e-01 2.143e-01 1.408 0.159169
## campaign -6.649e-02 3.825e-02 -1.738 0.082159 .
## pdays -3.853e-04 6.419e-04 -0.600 0.548294
## previous 1.532e-01 1.705e-01 0.898 0.368932
## poutcomenonexistent 4.941e-01 2.867e-01 1.723 0.084827 .
## poutcomesuccess 1.175e+00 6.380e-01 1.841 0.065561 .
## emp.var.rate -8.374e-01 4.303e-01 -1.946 0.051655 .
## cons.price.idx 1.461e+00 7.507e-01 1.946 0.051715 .
## cons.conf.idx 5.473e-02 2.572e-02 2.128 0.033364 *
## euribor3m -6.369e-02 4.016e-01 -0.159 0.874014
## nr.employed 2.818e-03 9.212e-03 0.306 0.759658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2323.4 on 3295 degrees of freedom
## Residual deviance: 1779.7 on 3244 degrees of freedom
## AIC: 1883.7
##
## Number of Fisher Scoring iterations: 12
# Stepwise model selection with AIC criteria
glm.aic <- step(log.model.null, scope = list(upper = log.model.full),
direction = "both", test = "Chisq", trace = F)
summary(glm.aic)
##
## Call:
## glm(formula = y ~ nr.employed + month + poutcome + contact +
## cons.conf.idx + age + campaign, family = "binomial", data = bank_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0084 -0.3925 -0.3247 -0.2505 2.6965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 54.971712 5.086656 10.807 < 2e-16 ***
## nr.employed -0.010909 0.001008 -10.817 < 2e-16 ***
## monthaug -0.197300 0.320996 -0.615 0.53879
## monthdec 0.405214 0.551729 0.734 0.46268
## monthjul 0.170190 0.293655 0.580 0.56221
## monthjun 0.834335 0.285462 2.923 0.00347 **
## monthmar 1.368456 0.417144 3.281 0.00104 **
## monthmay -0.550316 0.250787 -2.194 0.02821 *
## monthnov -0.440300 0.308987 -1.425 0.15416
## monthoct -0.355389 0.405363 -0.877 0.38064
## monthsep -0.584459 0.413824 -1.412 0.15785
## poutcomenonexistent 0.240466 0.185225 1.298 0.19421
## poutcomesuccess 1.583589 0.260890 6.070 1.28e-09 ***
## contacttelephone -0.596318 0.193693 -3.079 0.00208 **
## cons.conf.idx 0.034335 0.014910 2.303 0.02129 *
## age 0.012431 0.005273 2.358 0.01839 *
## campaign -0.070731 0.038236 -1.850 0.06433 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2323.4 on 3295 degrees of freedom
## Residual deviance: 1802.8 on 3279 degrees of freedom
## AIC: 1836.8
##
## Number of Fisher Scoring iterations: 6
# Stepwise model selection with BIC criteria
glm.bic <- step(log.model.null, scope = list(upper = log.model.full),
direction="both", test="Chisq", trace = F, k=log(nrow(bank_train)))
summary(glm.bic)
##
## Call:
## glm(formula = y ~ nr.employed + pdays, family = "binomial", data = bank_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7171 -0.5619 -0.3366 -0.2824 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 56.4726824 3.9295590 14.371 < 2e-16 ***
## nr.employed -0.0111319 0.0007806 -14.261 < 2e-16 ***
## pdays -0.0014777 0.0002094 -7.058 1.69e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2323.4 on 3295 degrees of freedom
## Residual deviance: 1899.8 on 3293 degrees of freedom
## AIC: 1905.8
##
## Number of Fisher Scoring iterations: 5
pred.aic = prediction(predict(glm.aic,bank_train,type='response'),bank_train$y)
# Computing threshold for cutoff to best trade off sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)
# Second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2))
mtext('Specificity',side=4, col='red')
# Find intersection
min.diff.aic = which.min(abs(unlist(performance(pred.aic,'sens')@y.values)-unlist(performance(pred.aic,'spec')@y.values)))
min.x.aic = unlist(performance(pred.aic,'sens')@x.values)[min.diff.aic]
min.y.aic = unlist(performance(pred.aic,'spec')@y.values)[min.diff.aic]
optimal.aic = min.x.aic
optimal.aic
## 549
## 0.07573385
pred.bic = prediction(predict(glm.bic,bank_train,type='response'),bank_train$y)
# Computing threshold for cutoff to best trade off sensitivity and specificity
plot(unlist(performance(pred.bic,'sens')@x.values),unlist(performance(pred.bic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for BIC Model',side=3)
# Second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.bic,'spec')@x.values),unlist(performance(pred.bic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2))
mtext('Specificity',side=4,col='red')
# Find intersection
min.diff.bic = which.min(abs(unlist(performance(pred.bic,'sens')@y.values)-unlist(performance(pred.bic,'spec')@y.values)))
min.x.bic = unlist(performance(pred.bic,'sens')@x.values)[min.diff.bic]
min.y.bic = unlist(performance(pred.bic,'spec')@y.values)[min.diff.bic]
optimal.bic = min.x.bic
optimal.bic
## 4117
## 0.1460365
# Predicted probabilities for AIC model
prob.aic <- predict.glm(glm.aic, newdata = bank_test, type="response")
# Convert AIC probabilities to binary
pred.aic <- ifelse(prob.aic >= optimal.aic, 1, 0)
# Check the accuracy of AIC model
confusionMatrix(as.factor(pred.aic), as.factor(bank_test$y), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 574 31
## 1 170 48
##
## Accuracy : 0.7558
## 95% CI : (0.7249, 0.7848)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2122
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60759
## Specificity : 0.77151
## Pos Pred Value : 0.22018
## Neg Pred Value : 0.94876
## Prevalence : 0.09599
## Detection Rate : 0.05832
## Detection Prevalence : 0.26488
## Balanced Accuracy : 0.68955
##
## 'Positive' Class : 1
##
# Predicted probabilities for BIC model
prob.bic <- predict.glm(glm.bic, newdata = bank_test, type="response")
# Convert BIC probabilities to binary
pred.bic <- ifelse(prob.bic >= optimal.bic, 1, 0)
# Check the accuracy of BIC model
confusionMatrix(as.factor(pred.bic), as.factor(bank_test$y), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 555 30
## 1 189 49
##
## Accuracy : 0.7339
## 95% CI : (0.7023, 0.7638)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1928
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.62025
## Specificity : 0.74597
## Pos Pred Value : 0.20588
## Neg Pred Value : 0.94872
## Prevalence : 0.09599
## Detection Rate : 0.05954
## Detection Prevalence : 0.28919
## Balanced Accuracy : 0.68311
##
## 'Positive' Class : 1
##