Bank Marketing Case Study

Executive Summary

A logistic regression model was used to predict whether or not a customer would subscribe to a long-term bank deposit, recorded as either a success or failure. Since this is a dichotomous classification problem, logistic regression was our first choice for this case study as the model is also able to provide information about the statistical significance of the predictors and their relationship to the outcome variable. In our analysis, we fit two logit models, one using stepwise selection with Akaike Information Criterion and one using Bayesian Information Criterion, and compared accuracy and sensitivity scores to find the best model. Our AIC model yielded a score of 75.6% accuracy on the testing data and 60.8% sensitivity. Although the BIC model had a slightly lower accuracy score of 73.4%, it had a higher sensitivity score of 62% and contained fewer variables, making it our final choice for this classification task.

The Problem

Bank marketing campaigns are a popular way to engage with customers, advertise a new product or service, and enhance business. Direct marketing is the approach used in this case study, which is a promotional method that involves contacting a targeted group of customers directly in order to increase sales of a product of interest. Some examples of direct marketing include text messaging, telemarketing, and email marketing. One of the important features of this strategy is the ability to measure the effectiveness of a marketing campaign by monitoring the feedback and responses from customers.

Using direct marketing campaign data from a Portuguese financial institution, our goal is to predict the likelihood of a customer subscribing to a long-term deposit. In this campaign, customers were contacted through phone calls, both inbound and outbound. Often, more than one contact to the same client was required to determine if the bank deposit would be subscribed to.

The importance of this study is to be able to identify and target customers that are likely to subscribe to long-term deposits using the predicted probabilities from our logistic regression model. This can help the bank more strategically contact clients by calling only those who are in our target group. Additionally, these probabilities can be used to segment customers into different clusters based on other marketing goals, like reaching out to customers who are at risk of leaving or switching financial institutions.

There are twenty attributes in the data set that provide information about the customer’s demographic, social, and economic status. We will explore how these features affect their likelihood to subscribe to a long-term deposit. In the remainder of the case study, we will discuss additional literature on existing methodologies used in this area, followed by a discussion of the techniques we used in this case study, and a brief description of the data set. Finally, we will present our findings from our final model and discuss our recommendations.

Methodology

In our analysis, we used a logistic regression model to fit the data and performed stepwise model selection using AIC criteria and BIC criteria to find the best subset of predictors. Akaike and Bayesian Information Criterion are two estimates of how well the model fits the data. Both methods involve minimizing the loss of information, with BIC penalizing the model more for its complexity. A lower value of these criteria indicates a better fit.

Prior to fitting the model, the data was split into two subsets, one for training the model and one for testing the performance of the model. The bank marketing data set was randomly split into two sample sets using an 80/20 ratio: training data (bank_train) and testing data (bank_test). The training data consists of 3,295 observations and the testing data consists of 824 observations.

The full logistic regression model was then fit using the training data set. Next, stepwise model selection was performed with AIC criteria, followed by BIC criteria. This was done in both directions using the Chi-squared test (direction = ‘both’, test = ‘Chisq’). The final model after performing AIC criteria includes seven variables: nr.employed (number of employees), poutcome (outcome of the previous marketing campaign), month (last contact month of year), campaign (number of contacts performed during this campaign and for this client), contact (contact communication type), cons.conf.idx (consumer confidence index), and age. The final model after performing BIC criteria contains only two variables: nr.employed and pdays (the number of days that passed by after the client was last contacted from a previous campaign).

In order to classify the predictions of both the AIC and BIC models, we first converted the probabilities into binary values, 0 or 1. To find the optimal threshold level, we looked at the Sensitivity and Specificity plots for each model and chose the point where the two lines intersected one another. For the AIC model, the intersection occurs at about 0.076.

For the BIC model, this occurs at 0.146 in the graph below.

These values were used as the cutoff level for the probabilities. This means that if the probability is equal to or greater than the threshold level for the specific model, it will classify the prediction as a 1, the customer will subscribe, and if it is lower than that value, the model will classify it as 0, the customer will not subscribe.

Logistic regression models perform under several major assumptions. The first assumption is that there is a linear relationship between the independent variables and the log odds of y. Additionally, the outcome variable should be categorical, either binary or ordinal, with no high correlations (multicollinearity) among the predictors. The final assumption is that the observations are independent of each other. Failure to comply with these assumptions may lead to a poor model fit and inaccurate results.

Data

The bank marketing data set can be found and imported from the UCI Machine Learning Repository. It comprises 4,119 instances with twenty input variables (ten numeric variables and ten categorical variables) and one response variable (variable y). The classification goal is to predict if the client will subscribe to a long-term bank deposit (yes or no) based on phone calls. The data was relatively clean to start out with. There were no missing values in the data set. Since we will be using a logistic regression model for this case study, we converted all categorical variables into factors at the beginning of our analysis.

The input variables for the logit model are divided into three categories: customer’s demographic data (age, job, marital, education, default, housing, loan), data related to the current campaign (contact, month, day_of_week, campaign, pdays, previous, poutcome), and data related to social and economic context (emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed).

The response variable, y, was converted into a binary outcome, 1 for yes and 0 for no. When the call is made, the variable duration is not known until the call ends and y is known. The duration is only used for benchmarking purpose and is highly correlated to the outcome variable, y. Therefore, we removed the variable duration from the data set.

After checking the structure of the data, we then looked at the distributions of the variables, and there is one histogram that stood out, which is the histogram of pdays.

According to the description of pdays, it is the number of days that passed by after the customer was last contacted from a previous campaign. It is a numeric variable, with the number 999 indicating that the customer was not previously contacted. Looking at the histogram shown above, we think the majority of the samples are clients who were not previously contacted. We did try removing pdays from the data set before fitting the models. However, it did not change the accuracy or sensitivity of either the AIC or BIC model, so we decided to keep this variable in the data set.

Findings

After fitting our models using the training data, we then compared the AIC and BIC models’ performance on the test data. The prediction results from the confusion matrix below show that the model using AIC criteria has the highest overall accuracy at 75.6%, with sensitivity at 60.8% and specificity at 77.2%. The model using BIC criteria has an accuracy of 73.4%, with a sensitivity at 62% and specificity at 74.6%.

# Accuracy of AIC model
confusionMatrix(as.factor(pred.aic), as.factor(bank_test$y), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 574  31
##          1 170  48
##                                           
##                Accuracy : 0.7558          
##                  95% CI : (0.7249, 0.7848)
##     No Information Rate : 0.904           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2122          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60759         
##             Specificity : 0.77151         
##          Pos Pred Value : 0.22018         
##          Neg Pred Value : 0.94876         
##              Prevalence : 0.09599         
##          Detection Rate : 0.05832         
##    Detection Prevalence : 0.26488         
##       Balanced Accuracy : 0.68955         
##                                           
##        'Positive' Class : 1               
##

# Accuracy of BIC model
confusionMatrix(as.factor(pred.bic), as.factor(bank_test$y), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 555  30
##          1 189  49
##                                           
##                Accuracy : 0.7339          
##                  95% CI : (0.7023, 0.7638)
##     No Information Rate : 0.904           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1928          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.62025         
##             Specificity : 0.74597         
##          Pos Pred Value : 0.20588         
##          Neg Pred Value : 0.94872         
##              Prevalence : 0.09599         
##          Detection Rate : 0.05954         
##    Detection Prevalence : 0.28919         
##       Balanced Accuracy : 0.68311         
##                                           
##        'Positive' Class : 1               
##

Our final choice for this case study is the Bayesian Information Criterion model as it has the highest sensitivity level and contains fewer variables, meaning that it is less complex. Below are the results of the BIC model.

glm.bic <- step(log.model.null, scope = list(upper = log.model.full), 
                direction="both", test="Chisq", trace = F, k=log(nrow(bank_train))) 
summary(glm.bic)

## 
## Call:
## glm(formula = y ~ nr.employed + pdays, family = "binomial", data = bank_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7171  -0.5619  -0.3366  -0.2824   2.5463  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 56.4726824  3.9295590  14.371  < 2e-16 ***
## nr.employed -0.0111319  0.0007806 -14.261  < 2e-16 ***
## pdays       -0.0014777  0.0002094  -7.058 1.69e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2323.4  on 3295  degrees of freedom
## Residual deviance: 1899.8  on 3293  degrees of freedom
## AIC: 1905.8
## 
## Number of Fisher Scoring iterations: 5

Based on the table above, our final model equation is: \(\log\frac{p}{(1-p)} = 56.4726824 - 0.0111319Nr.employed - 0.0014777Pdays\)

Both predictors are negatively associated with our response variable. The odds that target customers are likely to subscribe to long-term deposits change by a factor of \(e^{-0.0111319} = 0.989\) with one unit increase in nr.employed, when all other variables are held constant. This means that a customer is more likely to subscribe when the number of employees is lower. Additionally, the odds that target customers are likely to subscribe to long-term deposits change by a factor of \(e^{-0.0014777} = 0.999\) with one unit increase in pdays, when all other variables are held constant. Thus, the likelihood of a customer subscribing increases when fewer days have passed since the client was last contacted from a previous campaign.

Conclusions

To achieve our goal in this case study of identifying and targeting customers that are likely to subscribe to a long-term bank deposit using the predicted probabilities from the logistic regression model, the model should focus on correctly detecting positive effects. The portion of actual positives that are correctly identified should be the primary importance. Therefore, the model should have higher sensitivity, or true positive rate, compared to the specificity, the true negative rate.

By summarizing the results of the logistic models using Akaike and Bayesian Information Criterion, we recommend the model using Bayesian criteria. Both models have similar accuracy rates but the model using BIC criteria has higher sensitivity, meaning that it predicts fewer false positives. The BIC model also contains less variables, making it less complex and our final choice for this classification task.

Appendix

library(caret)
library(ROCR)
library(ggplot2)
theme_set(theme_light())

# Load the data
bank <- read.csv("bank-additional.csv", sep = ";", stringsAsFactors = TRUE)
attach(bank)

# Change response variable to 1 for yes and 0 for no
bank$y <- ifelse(bank$y=="yes",1,0)
str(bank)

## 'data.frame':    4119 obs. of  21 variables:
##  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
##  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
##  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
##  $ y             : num  0 0 0 0 0 0 0 0 0 0 ...

summary(bank)

##       age                 job           marital                   education   
##  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
##  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
##  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
##  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
##  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
##  Max.   :88.00   retired    : 166                   basic.6y           : 228  
##                  (Other)    : 649                   (Other)            : 168  
##     default        housing          loan           contact         month     
##  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
##  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
##  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
##                                                                jun    : 530  
##                                                                nov    : 446  
##                                                                apr    : 215  
##                                                                (Other): 203  
##  day_of_week    duration         campaign          pdays          previous     
##  fri:768     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
##  mon:855     1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
##  thu:860     Median : 181.0   Median : 2.000   Median :999.0   Median :0.0000  
##  tue:841     Mean   : 256.8   Mean   : 2.537   Mean   :960.4   Mean   :0.1903  
##  wed:795     3rd Qu.: 317.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
##              Max.   :3643.0   Max.   :35.000   Max.   :999.0   Max.   :6.0000  
##                                                                                
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 454   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:3523   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 142   Median : 1.10000   Median :93.75   Median :-41.8  
##                     Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
##                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                       
##    euribor3m      nr.employed         y         
##  Min.   :0.635   Min.   :4964   Min.   :0.0000  
##  1st Qu.:1.334   1st Qu.:5099   1st Qu.:0.0000  
##  Median :4.857   Median :5191   Median :0.0000  
##  Mean   :3.621   Mean   :5166   Mean   :0.1095  
##  3rd Qu.:4.961   3rd Qu.:5228   3rd Qu.:0.0000  
##  Max.   :5.045   Max.   :5228   Max.   :1.0000  
##

# Check for missing data
sum(is.na(bank))

## [1] 0

# Check the balance of response variable
table(bank$y)

## 
##    0    1 
## 3668  451

# Remove duration from data since it is only used for benchmarking purposes
bank <- bank[-c(11)]

# Set seed for reproducibility
set.seed(1)

# Split the data into Training and Test sets
inTrain <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[inTrain, ]
bank_test <- bank[-inTrain, ]

# Null model
log.model.null <- glm(y ~ 1, data = bank_train, family = "binomial")

# Full logistic regression model
log.model.full <- glm(y ~ ., data = bank_train, family = "binomial")
summary(log.model.full)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9448  -0.3964  -0.3156  -0.2456   2.8720  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.515e+02  1.136e+02  -1.334 0.182208    
## age                           1.788e-02  7.612e-03   2.349 0.018804 *  
## jobblue-collar               -2.756e-01  2.499e-01  -1.103 0.270111    
## jobentrepreneur              -6.908e-01  4.583e-01  -1.507 0.131717    
## jobhousemaid                 -2.236e-01  4.565e-01  -0.490 0.624289    
## jobmanagement                -5.767e-01  2.775e-01  -2.078 0.037685 *  
## jobretired                   -1.944e-01  3.341e-01  -0.582 0.560620    
## jobself-employed             -7.595e-01  4.001e-01  -1.898 0.057647 .  
## jobservices                  -2.237e-01  2.754e-01  -0.812 0.416589    
## jobstudent                    1.203e-01  3.826e-01   0.314 0.753278    
## jobtechnician                -5.621e-02  2.131e-01  -0.264 0.792001    
## jobunemployed                -7.829e-02  3.703e-01  -0.211 0.832583    
## jobunknown                   -1.625e-01  6.486e-01  -0.250 0.802233    
## maritalmarried                3.365e-02  2.182e-01   0.154 0.877422    
## maritalsingle                 1.147e-01  2.536e-01   0.452 0.651045    
## maritalunknown               -4.919e-03  1.197e+00  -0.004 0.996720    
## educationbasic.6y             1.672e-01  3.737e-01   0.447 0.654654    
## educationbasic.9y             2.308e-01  2.920e-01   0.790 0.429430    
## educationhigh.school         -2.213e-02  2.868e-01  -0.077 0.938486    
## educationilliterate          -1.235e+01  5.354e+02  -0.023 0.981600    
## educationprofessional.course  5.175e-02  3.129e-01   0.165 0.868633    
## educationuniversity.degree    2.082e-01  2.845e-01   0.732 0.464302    
## educationunknown              1.200e-01  3.800e-01   0.316 0.752268    
## defaultunknown                2.480e-02  1.971e-01   0.126 0.899877    
## defaultyes                   -1.013e+01  5.354e+02  -0.019 0.984898    
## housingunknown               -2.577e-01  4.405e-01  -0.585 0.558572    
## housingyes                   -9.526e-02  1.309e-01  -0.728 0.466633    
## loanunknown                          NA         NA      NA       NA    
## loanyes                      -1.183e-01  1.790e-01  -0.661 0.508744    
## contacttelephone             -8.928e-01  2.618e-01  -3.411 0.000648 ***
## monthaug                     -7.220e-02  3.927e-01  -0.184 0.854106    
## monthdec                      6.675e-01  6.107e-01   1.093 0.274432    
## monthjul                     -1.160e-01  3.367e-01  -0.344 0.730557    
## monthjun                      1.707e-01  4.029e-01   0.424 0.671819    
## monthmar                      1.740e+00  4.972e-01   3.499 0.000467 ***
## monthmay                     -4.285e-01  2.790e-01  -1.536 0.124614    
## monthnov                     -5.046e-01  3.949e-01  -1.278 0.201300    
## monthoct                     -2.757e-01  4.978e-01  -0.554 0.579661    
## monthsep                     -2.212e-01  5.639e-01  -0.392 0.694841    
## day_of_weekmon                1.131e-01  2.062e-01   0.548 0.583501    
## day_of_weekthu                1.544e-01  2.096e-01   0.737 0.461389    
## day_of_weektue                1.380e-01  2.093e-01   0.659 0.509837    
## day_of_weekwed                3.017e-01  2.143e-01   1.408 0.159169    
## campaign                     -6.649e-02  3.825e-02  -1.738 0.082159 .  
## pdays                        -3.853e-04  6.419e-04  -0.600 0.548294    
## previous                      1.532e-01  1.705e-01   0.898 0.368932    
## poutcomenonexistent           4.941e-01  2.867e-01   1.723 0.084827 .  
## poutcomesuccess               1.175e+00  6.380e-01   1.841 0.065561 .  
## emp.var.rate                 -8.374e-01  4.303e-01  -1.946 0.051655 .  
## cons.price.idx                1.461e+00  7.507e-01   1.946 0.051715 .  
## cons.conf.idx                 5.473e-02  2.572e-02   2.128 0.033364 *  
## euribor3m                    -6.369e-02  4.016e-01  -0.159 0.874014    
## nr.employed                   2.818e-03  9.212e-03   0.306 0.759658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2323.4  on 3295  degrees of freedom
## Residual deviance: 1779.7  on 3244  degrees of freedom
## AIC: 1883.7
## 
## Number of Fisher Scoring iterations: 12

# Stepwise model selection with AIC criteria
glm.aic <- step(log.model.null, scope = list(upper = log.model.full),
                      direction = "both", test = "Chisq", trace = F)
summary(glm.aic)

## 
## Call:
## glm(formula = y ~ nr.employed + month + poutcome + contact + 
##     cons.conf.idx + age + campaign, family = "binomial", data = bank_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0084  -0.3925  -0.3247  -0.2505   2.6965  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         54.971712   5.086656  10.807  < 2e-16 ***
## nr.employed         -0.010909   0.001008 -10.817  < 2e-16 ***
## monthaug            -0.197300   0.320996  -0.615  0.53879    
## monthdec             0.405214   0.551729   0.734  0.46268    
## monthjul             0.170190   0.293655   0.580  0.56221    
## monthjun             0.834335   0.285462   2.923  0.00347 ** 
## monthmar             1.368456   0.417144   3.281  0.00104 ** 
## monthmay            -0.550316   0.250787  -2.194  0.02821 *  
## monthnov            -0.440300   0.308987  -1.425  0.15416    
## monthoct            -0.355389   0.405363  -0.877  0.38064    
## monthsep            -0.584459   0.413824  -1.412  0.15785    
## poutcomenonexistent  0.240466   0.185225   1.298  0.19421    
## poutcomesuccess      1.583589   0.260890   6.070 1.28e-09 ***
## contacttelephone    -0.596318   0.193693  -3.079  0.00208 ** 
## cons.conf.idx        0.034335   0.014910   2.303  0.02129 *  
## age                  0.012431   0.005273   2.358  0.01839 *  
## campaign            -0.070731   0.038236  -1.850  0.06433 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2323.4  on 3295  degrees of freedom
## Residual deviance: 1802.8  on 3279  degrees of freedom
## AIC: 1836.8
## 
## Number of Fisher Scoring iterations: 6

# Stepwise model selection with BIC criteria
glm.bic <- step(log.model.null, scope = list(upper = log.model.full), 
                direction="both", test="Chisq", trace = F, k=log(nrow(bank_train))) 
summary(glm.bic)

## 
## Call:
## glm(formula = y ~ nr.employed + pdays, family = "binomial", data = bank_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7171  -0.5619  -0.3366  -0.2824   2.5463  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 56.4726824  3.9295590  14.371  < 2e-16 ***
## nr.employed -0.0111319  0.0007806 -14.261  < 2e-16 ***
## pdays       -0.0014777  0.0002094  -7.058 1.69e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2323.4  on 3295  degrees of freedom
## Residual deviance: 1899.8  on 3293  degrees of freedom
## AIC: 1905.8
## 
## Number of Fisher Scoring iterations: 5

pred.aic = prediction(predict(glm.aic,bank_train,type='response'),bank_train$y)

# Computing threshold for cutoff to best trade off sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)

# Second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2)) 
mtext('Specificity',side=4, col='red')

# Find intersection
min.diff.aic = which.min(abs(unlist(performance(pred.aic,'sens')@y.values)-unlist(performance(pred.aic,'spec')@y.values)))
min.x.aic = unlist(performance(pred.aic,'sens')@x.values)[min.diff.aic]
min.y.aic = unlist(performance(pred.aic,'spec')@y.values)[min.diff.aic]
optimal.aic = min.x.aic
optimal.aic

##        549 
## 0.07573385

pred.bic = prediction(predict(glm.bic,bank_train,type='response'),bank_train$y)

# Computing threshold for cutoff to best trade off sensitivity and specificity
plot(unlist(performance(pred.bic,'sens')@x.values),unlist(performance(pred.bic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for BIC Model',side=3)

# Second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.bic,'spec')@x.values),unlist(performance(pred.bic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2))
mtext('Specificity',side=4,col='red')

# Find intersection
min.diff.bic = which.min(abs(unlist(performance(pred.bic,'sens')@y.values)-unlist(performance(pred.bic,'spec')@y.values)))
min.x.bic = unlist(performance(pred.bic,'sens')@x.values)[min.diff.bic]
min.y.bic = unlist(performance(pred.bic,'spec')@y.values)[min.diff.bic]
optimal.bic = min.x.bic
optimal.bic

##      4117 
## 0.1460365

# Predicted probabilities for AIC model
prob.aic <- predict.glm(glm.aic, newdata = bank_test, type="response")
# Convert AIC probabilities to binary
pred.aic <- ifelse(prob.aic >= optimal.aic, 1, 0)
# Check the accuracy of AIC model
confusionMatrix(as.factor(pred.aic), as.factor(bank_test$y), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 574  31
##          1 170  48
##                                           
##                Accuracy : 0.7558          
##                  95% CI : (0.7249, 0.7848)
##     No Information Rate : 0.904           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2122          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60759         
##             Specificity : 0.77151         
##          Pos Pred Value : 0.22018         
##          Neg Pred Value : 0.94876         
##              Prevalence : 0.09599         
##          Detection Rate : 0.05832         
##    Detection Prevalence : 0.26488         
##       Balanced Accuracy : 0.68955         
##                                           
##        'Positive' Class : 1               
##

# Predicted probabilities for BIC model
prob.bic <- predict.glm(glm.bic, newdata = bank_test, type="response")
# Convert BIC probabilities to binary
pred.bic <- ifelse(prob.bic >= optimal.bic, 1, 0)
# Check the accuracy of BIC model
confusionMatrix(as.factor(pred.bic), as.factor(bank_test$y), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 555  30
##          1 189  49
##                                           
##                Accuracy : 0.7339          
##                  95% CI : (0.7023, 0.7638)
##     No Information Rate : 0.904           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1928          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.62025         
##             Specificity : 0.74597         
##          Pos Pred Value : 0.20588         
##          Neg Pred Value : 0.94872         
##              Prevalence : 0.09599         
##          Detection Rate : 0.05954         
##    Detection Prevalence : 0.28919         
##       Balanced Accuracy : 0.68311         
##                                           
##        'Positive' Class : 1               
##

Bank Marketing Case Study

Team: Kelli Belcher, Nessie Tran, Adrianne Kristianto, Mingwei Gu

2021-02-15

Executive Summary

The Problem

Methodology

Data

Findings

Conclusions

Appendix

Executive Summary

The Problem

Review of Related Literature

Methodology

Data

Findings

Conclusions

Appendix