Bank Case Study

Executive Summary

In order to be profitable, a business must reduce cost and target the right customers. This holds true for any organization or company, and with the growing field of data and analytics, the ability to create targeted marketing campaigns is a potentially lucrative investment that should be considered by any organization interested in making the most out of the stores of data at their fingertips.

The purpose of this case study is to highlight the importance of strategic and data driven predictive models and predict if a client will subscribe to a term deposit with a Portuguese Bank. Data for this case study comes from the UCI Machine Learning Repository, it contains information based on the Portuguese Bank’s telemarketing campaign, and can be obtained at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The dataset will be used to explore questions such as: What are the best predictors of whether or not a client subscribed to a term deposit? How accurate is the model at predicting our target variable? What are the main drivers for determining whether a client will or won’t subscribe to a term deposit?

A logistic model resulted in the highest accuracy and could best predict whether or not a client would subscribe to a term deposit with the Bank. The model predicts if a customer will purchase a term deposit product based on:

How the customer was contacted
The month the customer was contacted
The duration of the last contact with a customer
The number of contacts with a client during the ongoing direct marketing campaign
The number of contacts with a customer in previous marketing campaigns
The outcome of a previous marketing campaign

The Problem

As we delve into the data and determine the best model for this use case, answer questions, and uncover the appropriate levers to pull, we will be better able to translate the results into data driven insights as well as actions that the business can directly use and take to best guide their marketing campaign.

This will be accomplished through data investigation, data cleaning, model building and interpretation. The insights uncovered will rely on the appropriate interpretation of the model which most accurately determine whether or not a client will subscribe to a term deposit. The Portuguese Bank Marketing Data set will be used to train and test various predictive models within R. In order to best serve the bank, the goal of the models built and examined will be accuracy, consistency, and reliability.

The best model will be identified through model interpretation of model statistics such as accuracy, AIC, p-values and beta coefficients. In order to best guide and inform the Portuguese Bank, the outputs will be closely examined to determine what the best predictors of our target variable are, and to provide recommendations for a more targeted marketing campaign which will yield the best results and drive the most profit. To best accomplish the ability to provide actionable insights, the data pre-processing, cleaning and eventual model interpretation will need to rely on best practices and existing techniques known to work for predictive model building.

Methodology

In this case study, a logistic model and LDA will be trained to predict if a customer will purchase a term deposit product. The predicted outcome variable will be a binary yes/no for if someone is a respondent or not. The dataset used for this analysis is related to the direct marketing campaigns of Portuguese banks and is a randomly selected ten percent sample of the full dataset. The dataset contains 4119 observations and 21 variables – a binary output variable y and 20 input variables, of which 10 are continuous and 10 are categorical. The categorical variables will each need n-1 dummy variables, where n is the number of categories, in order to be used in the model. Significant variables will be found using backward selection, however, an analysis of the selected variables will be conducted to ensure the significance of variables in the model.

Plots will be used to identify potentially significant input variables, possible interactions or multicollinearity between variables, and help identify missing our outlier observations. Outliers, and missing data points will be omitted or handled as specified in the below Data section. To train and test the model, the data will be divided into training and testing sets using an 80/20 split, resulting in a training set with 2907 observations and a test set with 726 observations.

As with other parametric methods, logistic models have several assumptions that must be satisfied to ensure an accurate model is built. These assumptions are:

Observations are independent from each other
Independent variables have a linear relationship with the log odds
Independent variables have little or no multicollinearity
Sample size is large
As this is a binary model – predicting ‘yes’ or ‘no’ – the dependent variable must be binary

Additionally, logistic models may be over-fit on the training data, particularly when working with high-dimensional datasets with relatively few observations. This can result in a high predicted accuracy with the training data, but poor accuracy when applied to real-world data. Due to the large number of observations in the bank dataset, over-fitting is not likely to be a concern in this analysis, but will still be checked for.

Data

The dataset consisted of 4119 observations and 21 total variables including the response variable, y. Initial analysis showed the data was fairly clean and had no missing values in any of the variables, there were multiple “unknown” responses in many of the categorical variables. These unknown values were examined on a case-by-case basis and (removed in the event that the number of observations within this category was comparatively low). Along with the unknown values, other opportunities to clean up some of the variables prior to training and running our models were found. There was a single “illiterate” response in the Education variable, and it was found that there were relatively few observations where age was below 25 or above 60. All data cleaning and pre-processing steps are:

unknown responses removed from Job
unknown responses removed from Marital
unknown responses removed from Housing
unknown responses removed from Education
illiterate responses removed from Education.
Customers with age younger than 25 or older than 60 removed

The final result for the categorical variables were factored again using the cleaned dataset. A total of 465 observations were excluded leaving our “clean” dataset with 3654 observations.

None of the social and economic variables were included in the dataset that the model was built from. This was done to keep the focus of the model on the potential customer, and it was found that an accuracte model could be trained without these additional attributes.

Findings

Our analysis resulted in a logistic model having the highest accuracy in predicting the target variable. The final model has a relatively high accuracy rate of 92.98% and the following formula: \(y = -.22*maritalmarried + .15*maritalsingle - 1.67*contacttelephone - .24*monthaug + 1.37*monthdec - .81*monthjul + 1.44*monthjun + 2.39*monthmar - .23*monthmay - .77*monthnov + 1.7*monthoct + 1.29*monthsep + .0049*duration -.11*campaign +.38*previous + .18*poutcomenonexistent + 2.51*poutcomesuccess\)

This can be interpreted as:

A married individual is the least likely to purchase a term deposit product, followed by divorced individuals, with a single individual most likely to purchase
Customers contacted via landline telephones are less likely to purchase than someone contacted on a mobile phone
The longer the duration of the last call to a customer was, the more likely they are to purchase
The more times a customer was called in the previous marketing campaign, the more likely they are to purchase
The more times a customer was called in the current marketing campaign, the less likely they are to purchase
Customers who purchased a term deposit in the previous marketing campaign are more likely to purchase if contacted in this campaign
Customers contacted in the previous December, June, March, October, and September are more likely to purchase a term deposit, while those contacted in July, May, and November are less likely to purchase.

The confusion matrix for this model against the test data shows that this model has a high accuracy rate, and is particularly good at predicting non-responders, or those who would not purchase a term deposit product.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 643   9
##          1  42  32
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.9087, 0.9473)
##     No Information Rate : 0.9435          
##     P-Value [Acc > NIR] : 0.9505          
##                                           
##                   Kappa : 0.5218          
##                                           
##  Mcnemar's Test P-Value : 7.433e-06       
##                                           
##             Sensitivity : 0.9387          
##             Specificity : 0.7805          
##          Pos Pred Value : 0.9862          
##          Neg Pred Value : 0.4324          
##              Prevalence : 0.9435          
##          Detection Rate : 0.8857          
##    Detection Prevalence : 0.8981          
##       Balanced Accuracy : 0.8596          
##                                           
##        'Positive' Class : 0               
##

A LDA model was also fit, but the logistic model was found to have a higher test accuracy.

Based on the below residual plots, the residuals appear to be approximately normally distributed. Additionally, looking at the Cook’s distance, while there are some observations with a large distance, none appear to be extreme outliers.

The model initially built using backward selection included all of the variables in the final model outlined above, as well as the Marital variable. However, further analysis of p-values given by the model output showed that marital was insignificant. After removing the marital variable model performance improved by .97 %

Before determining that the preferred model would be built through logistic regression, an LDA model was trained and tested with the same pre-processed dataset and identified variables. The LDA model performed marginally worse than the linear model, with an accuracy of 92.01%. Given that even 1% improvement could potentially lead to a significant increase in profits, we chose the linear model over the LDA.

Conclusion and Recommendations

In conclusion, we recommend using the logistic model described above to predict if the client will respond or not to a direct marketing campaign and purchase a term deposit. As mentioned above, this model had the greatest prediction power compared to other models tested, and is particularly good at predicting non-responders. This will allow for greater focus on customers who are more likely to purchase a product and will save time and money.

Despite efforts made to gain the most predictive power out of the model identified, there are opportunities to improve upon our model, methodology, and data. One way would be to better understand the specifics of the marketing campaign so that the analysis could be more tailored and better focused. Additional analysis into what factors might lead a customer to not purchase a term deposit in this marketing campaign but purchase one in the next campaign would be useful as well. The model shows that the number of times a customer was contacted in a previous campaign impacts their decision in the current campaign, so better understanding of the factors that impact that decision would be very helpful and could lead to greater insights and consequent recommendations to the business which would prove to be profitable.

Citations

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Babakus, E., Eroglu, S., & Yavas, U. (2004). Modeling consumers’ choice behavior: an application in banking. Journal of Services Marketing, 18(6), 462–470. https://doi.org/10.1108/08876040410557249

Appendix and Supplemental Code

Model Fitting

BD2$job <- factor(BD2$job)
BD2$marital <- factor(BD2$marital)
BD2$education <- factor(BD2$education)
BD2$default <- factor(BD2$default)
BD2$housing <- factor(BD2$housing)
BD2$loan <- factor(BD2$loan)
BD2$contact <- factor(BD2$contact)
BD2$month <- factor(BD2$month)
BD2$day_of_week <- factor(BD2$day_of_week)
BD2$poutcome <- factor(BD2$poutcome)


str(BD2)

## 'data.frame':    3633 obs. of  21 variables:
##  $ age           : int  30 39 25 47 32 32 41 31 35 25 ...
##  $ job           : Factor w/ 11 levels "admin.","blue-collar",..: 2 8 8 1 8 1 3 8 2 8 ...
##  $ marital       : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 3 2 1 2 3 ...
##  $ education     : Factor w/ 6 levels "basic.4y","basic.6y",..: 3 4 4 6 6 6 6 5 3 2 ...
##  $ default       : Factor w/ 2 levels "no","unknown": 1 1 1 1 1 1 2 1 2 2 ...
##  $ housing       : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 2 ...
##  $ loan          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 1 1 1 1 1 2 1 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 8 10 10 8 8 7 4 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 2 3 2 2 4 3 3 ...
##  $ duration      : int  487 346 227 58 128 290 44 68 170 301 ...
##  $ campaign      : int  2 4 1 1 3 4 2 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 2 0 0 1 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 1 2 2 1 2 2 ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 1.4 ...
##  $ cons.price.idx: num  92.9 94 94.5 93.2 94.2 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 -42.7 ...
##  $ euribor3m     : num  1.313 4.855 4.962 4.191 0.884 ...
##  $ nr.employed   : num  5099 5191 5228 5196 4964 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

BD2 = BD2[c(1:15,21)]

str(BD2)

## 'data.frame':    3633 obs. of  16 variables:
##  $ age        : int  30 39 25 47 32 32 41 31 35 25 ...
##  $ job        : Factor w/ 11 levels "admin.","blue-collar",..: 2 8 8 1 8 1 3 8 2 8 ...
##  $ marital    : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 3 2 1 2 3 ...
##  $ education  : Factor w/ 6 levels "basic.4y","basic.6y",..: 3 4 4 6 6 6 6 5 3 2 ...
##  $ default    : Factor w/ 2 levels "no","unknown": 1 1 1 1 1 1 2 1 2 2 ...
##  $ housing    : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 2 ...
##  $ loan       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contact    : Factor w/ 2 levels "cellular","telephone": 1 2 2 1 1 1 1 1 2 1 ...
##  $ month      : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 8 10 10 8 8 7 4 ...
##  $ day_of_week: Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 2 3 2 2 4 3 3 ...
##  $ duration   : int  487 346 227 58 128 290 44 68 170 301 ...
##  $ campaign   : int  2 4 1 1 3 4 2 1 1 1 ...
##  $ pdays      : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous   : int  0 0 0 0 2 0 0 1 0 0 ...
##  $ poutcome   : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 1 2 2 1 2 2 ...
##  $ y          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

set.seed(1)
inTrain<-createDataPartition(BD2$y,p=0.8,list=FALSE)
BankTrain1=BD2[inTrain,]
BankTest1=BD2[-inTrain,]

BankTrain1$y <- ifelse(BankTrain1$y == 'yes',1,0)

BankTest1$y <- ifelse(BankTest1$y == 'yes',1,0)

full.model = glm(formula = y ~ ., data = BankTrain1, family = binomial)

summary(full.model)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = BankTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9836  -0.3203  -0.2161  -0.1245   2.8379  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.2688353  1.2259843  -1.851 0.064224 .  
## age                          -0.0012974  0.0106139  -0.122 0.902713    
## jobblue-collar               -0.2149344  0.3238015  -0.664 0.506828    
## jobentrepreneur              -0.8631008  0.5910161  -1.460 0.144189    
## jobhousemaid                  0.2238285  0.5414284   0.413 0.679311    
## jobmanagement                -0.3002416  0.3322662  -0.904 0.366198    
## jobretired                   -0.0008408  0.4829278  -0.002 0.998611    
## jobself-employed             -0.4818882  0.4575717  -1.053 0.292276    
## jobservices                   0.2141980  0.3278194   0.653 0.513497    
## jobstudent                    1.3787048  0.5677547   2.428 0.015168 *  
## jobtechnician                 0.2451205  0.2569624   0.954 0.340126    
## jobunemployed                 0.3751666  0.4450409   0.843 0.399232    
## maritalmarried               -0.1729109  0.2676393  -0.646 0.518241    
## maritalsingle                -0.0227413  0.3031660  -0.075 0.940205    
## educationbasic.6y            -0.4467838  0.5077688  -0.880 0.378916    
## educationbasic.9y            -0.2452010  0.3763144  -0.652 0.514669    
## educationhigh.school         -0.2523120  0.3703567  -0.681 0.495702    
## educationprofessional.course -0.2177957  0.4071596  -0.535 0.592709    
## educationuniversity.degree    0.1740951  0.3751156   0.464 0.642568    
## defaultunknown               -0.0014664  0.2453144  -0.006 0.995231    
## housingyes                   -0.1098561  0.1611457  -0.682 0.495416    
## loanyes                      -0.1526658  0.2176320  -0.701 0.483000    
## contacttelephone             -1.6557381  0.2528481  -6.548 5.82e-11 ***
## monthaug                     -0.3100136  0.3701168  -0.838 0.402250    
## monthdec                      1.2927506  0.7904988   1.635 0.101973    
## monthjul                     -0.8140762  0.3836824  -2.122 0.033859 *  
## monthjun                      1.4072900  0.3968341   3.546 0.000391 ***
## monthmar                      2.2971836  0.5441679   4.221 2.43e-05 ***
## monthmay                     -0.1598098  0.3517958  -0.454 0.649635    
## monthnov                     -0.7872673  0.3969035  -1.984 0.047309 *  
## monthoct                      1.5265833  0.5480786   2.785 0.005347 ** 
## monthsep                      1.1295274  0.5613008   2.012 0.044184 *  
## day_of_weekmon               -0.0246007  0.2453536  -0.100 0.920133    
## day_of_weekthu               -0.1613287  0.2453545  -0.658 0.510838    
## day_of_weektue               -0.1615139  0.2524773  -0.640 0.522357    
## day_of_weekwed               -0.2006646  0.2623784  -0.765 0.444396    
## duration                      0.0049794  0.0002893  17.210  < 2e-16 ***
## campaign                     -0.1207924  0.0497010  -2.430 0.015083 *  
## pdays                        -0.0009569  0.0009024  -1.060 0.288969    
## previous                      0.2765877  0.2186057   1.265 0.205787    
## poutcomenonexistent           0.1167579  0.3660234   0.319 0.749734    
## poutcomesuccess               1.6406991  0.8806517   1.863 0.062455 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1913.2  on 2906  degrees of freedom
## Residual deviance: 1168.3  on 2865  degrees of freedom
## AIC: 1252.3
## 
## Number of Fisher Scoring iterations: 6

formula(full.model)

## y ~ age + job + marital + education + default + housing + loan + 
##     contact + month + day_of_week + duration + campaign + pdays + 
##     previous + poutcome

coef(full.model)

##                  (Intercept)                          age 
##                -2.2688353294                -0.0012973835 
##               jobblue-collar              jobentrepreneur 
##                -0.2149343636                -0.8631007616 
##                 jobhousemaid                jobmanagement 
##                 0.2238284810                -0.3002416362 
##                   jobretired             jobself-employed 
##                -0.0008407974                -0.4818882313 
##                  jobservices                   jobstudent 
##                 0.2141979987                 1.3787048425 
##                jobtechnician                jobunemployed 
##                 0.2451204821                 0.3751666080 
##               maritalmarried                maritalsingle 
##                -0.1729109304                -0.0227413151 
##            educationbasic.6y            educationbasic.9y 
##                -0.4467838136                -0.2452009743 
##         educationhigh.school educationprofessional.course 
##                -0.2523120303                -0.2177957128 
##   educationuniversity.degree               defaultunknown 
##                 0.1740951269                -0.0014663677 
##                   housingyes                      loanyes 
##                -0.1098561292                -0.1526658257 
##             contacttelephone                     monthaug 
##                -1.6557381465                -0.3100135548 
##                     monthdec                     monthjul 
##                 1.2927505652                -0.8140761735 
##                     monthjun                     monthmar 
##                 1.4072899992                 2.2971836372 
##                     monthmay                     monthnov 
##                -0.1598098032                -0.7872672748 
##                     monthoct                     monthsep 
##                 1.5265832986                 1.1295273823 
##               day_of_weekmon               day_of_weekthu 
##                -0.0246006730                -0.1613287415 
##               day_of_weektue               day_of_weekwed 
##                -0.1615139186                -0.2006645811 
##                     duration                     campaign 
##                 0.0049794469                -0.1207923723 
##                        pdays                     previous 
##                -0.0009569390                 0.2765876676 
##          poutcomenonexistent              poutcomesuccess 
##                 0.1167578724                 1.6406990552

BankTrain1$PredProb = predict.glm(full.model,newdata = BankTrain1, type = 'response')
BankTrain1$PredChoice = ifelse(BankTrain1$PredProb >= .5,1,0)

caret::confusionMatrix(as.factor(BankTrain1$y), as.factor(BankTrain1$PredChoice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2558   53
##          1  187  109
##                                           
##                Accuracy : 0.9174          
##                  95% CI : (0.9068, 0.9272)
##     No Information Rate : 0.9443          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4353          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9319          
##             Specificity : 0.6728          
##          Pos Pred Value : 0.9797          
##          Neg Pred Value : 0.3682          
##              Prevalence : 0.9443          
##          Detection Rate : 0.8799          
##    Detection Prevalence : 0.8982          
##       Balanced Accuracy : 0.8024          
##                                           
##        'Positive' Class : 0               
##

full.model.train.acc = .9174


BankTest1$PredProb = predict.glm(full.model,newdata = BankTest1, type = 'response')
BankTest1$PredChoice = ifelse(BankTest1$PredProb >= .5,1,0)

caret::confusionMatrix(as.factor(BankTest1$y), as.factor(BankTest1$PredChoice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 641  11
##          1  44  30
##                                           
##                Accuracy : 0.9242          
##                  95% CI : (0.9025, 0.9424)
##     No Information Rate : 0.9435          
##     P-Value [Acc > NIR] : 0.9875          
##                                           
##                   Kappa : 0.4843          
##                                           
##  Mcnemar's Test P-Value : 1.597e-05       
##                                           
##             Sensitivity : 0.9358          
##             Specificity : 0.7317          
##          Pos Pred Value : 0.9831          
##          Neg Pred Value : 0.4054          
##              Prevalence : 0.9435          
##          Detection Rate : 0.8829          
##    Detection Prevalence : 0.8981          
##       Balanced Accuracy : 0.8337          
##                                           
##        'Positive' Class : 0               
##

full.model.test.acc = .9242

set.seed(1)
back.model = step(full.model,trace=0)
summary(back.model)

## 
## Call:
## glm(formula = y ~ marital + contact + month + duration + campaign + 
##     previous + poutcome, family = binomial, data = BankTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9756  -0.3232  -0.2250  -0.1317   2.9492  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.6160581  0.4898341  -7.382 1.56e-13 ***
## maritalmarried      -0.2204167  0.2627254  -0.839 0.401490    
## maritalsingle        0.1506302  0.2753804   0.547 0.584386    
## contacttelephone    -1.6660945  0.2471946  -6.740 1.58e-11 ***
## monthaug            -0.2351446  0.3573054  -0.658 0.510470    
## monthdec             1.3693569  0.7650077   1.790 0.073455 .  
## monthjul            -0.8117587  0.3708845  -2.189 0.028618 *  
## monthjun             1.4405179  0.3840940   3.750 0.000177 ***
## monthmar             2.3857575  0.5200904   4.587 4.49e-06 ***
## monthmay            -0.2308838  0.3375313  -0.684 0.493952    
## monthnov            -0.7730728  0.3845480  -2.010 0.044395 *  
## monthoct             1.6974745  0.5312758   3.195 0.001398 ** 
## monthsep             1.2910765  0.5488018   2.353 0.018646 *  
## duration             0.0048860  0.0002816  17.351  < 2e-16 ***
## campaign            -0.1089008  0.0482991  -2.255 0.024151 *  
## previous             0.3676629  0.2015894   1.824 0.068179 .  
## poutcomenonexistent  0.1776117  0.3582398   0.496 0.620043    
## poutcomesuccess      2.5112497  0.3427697   7.326 2.37e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1913.2  on 2906  degrees of freedom
## Residual deviance: 1191.1  on 2889  degrees of freedom
## AIC: 1227.1
## 
## Number of Fisher Scoring iterations: 6

formula(back.model)

## y ~ marital + contact + month + duration + campaign + previous + 
##     poutcome

coef(back.model)

##         (Intercept)      maritalmarried       maritalsingle 
##        -3.616058085        -0.220416692         0.150630250 
##    contacttelephone            monthaug            monthdec 
##        -1.666094465        -0.235144556         1.369356855 
##            monthjul            monthjun            monthmar 
##        -0.811758739         1.440517945         2.385757489 
##            monthmay            monthnov            monthoct 
##        -0.230883834        -0.773072779         1.697474470 
##            monthsep            duration            campaign 
##         1.291076476         0.004886002        -0.108900812 
##            previous poutcomenonexistent     poutcomesuccess 
##         0.367662878         0.177611705         2.511249711

BankTest2=BD2[-inTrain,]
BankTest2$y <- ifelse(BankTest2$y == 'yes',1,0)



back.model.test.acc = .9242
BankTest2$PredProb = predict.glm(back.model,newdata = BankTest2, type = 'response')

BankTest2$PredChoice = ifelse(BankTest2$PredProb >= .5,1,0)

caret::confusionMatrix(as.factor(BankTest2$y), as.factor(BankTest2$PredChoice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 642  10
##          1  45  29
##                                           
##                Accuracy : 0.9242          
##                  95% CI : (0.9025, 0.9424)
##     No Information Rate : 0.9463          
##     P-Value [Acc > NIR] : 0.9951          
##                                           
##                   Kappa : 0.4764          
##                                           
##  Mcnemar's Test P-Value : 4.549e-06       
##                                           
##             Sensitivity : 0.9345          
##             Specificity : 0.7436          
##          Pos Pred Value : 0.9847          
##          Neg Pred Value : 0.3919          
##              Prevalence : 0.9463          
##          Detection Rate : 0.8843          
##    Detection Prevalence : 0.8981          
##       Balanced Accuracy : 0.8390          
##                                           
##        'Positive' Class : 0               
##

BankTest3=BD2[-inTrain,]
BankTest3$y <- ifelse(BankTest3$y == 'yes',1,0)

final = glm(formula = y ~contact + month + duration+ 
              campaign+ previous+poutcome, 
            data = BankTrain1, family = binomial)
summary(final)

## 
## Call:
## glm(formula = y ~ contact + month + duration + campaign + previous + 
##     poutcome, family = binomial, data = BankTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9575  -0.3211  -0.2291  -0.1301   2.8946  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.7144288  0.4393945  -8.454  < 2e-16 ***
## contacttelephone    -1.7156467  0.2464950  -6.960 3.40e-12 ***
## monthaug            -0.1841470  0.3570103  -0.516  0.60599    
## monthdec             1.3889177  0.7669722   1.811  0.07015 .  
## monthjul            -0.7576417  0.3701062  -2.047  0.04065 *  
## monthjun             1.5213752  0.3838998   3.963 7.40e-05 ***
## monthmar             2.5167602  0.5157246   4.880 1.06e-06 ***
## monthmay            -0.1969572  0.3376368  -0.583  0.55966    
## monthnov            -0.7495801  0.3847047  -1.948  0.05136 .  
## monthoct             1.8201343  0.5252828   3.465  0.00053 ***
## monthsep             1.3840190  0.5448186   2.540  0.01107 *  
## duration             0.0048717  0.0002812  17.324  < 2e-16 ***
## campaign            -0.1075191  0.0480860  -2.236  0.02535 *  
## previous             0.3743971  0.1994030   1.878  0.06044 .  
## poutcomenonexistent  0.1584876  0.3565704   0.444  0.65670    
## poutcomesuccess      2.4771524  0.3416229   7.251 4.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1913.2  on 2906  degrees of freedom
## Residual deviance: 1195.7  on 2891  degrees of freedom
## AIC: 1227.7
## 
## Number of Fisher Scoring iterations: 6

BankTest3$PredProb = predict.glm(final,newdata = BankTest3, type = 'response')

BankTest3$PredChoice = ifelse(BankTest3$PredProb >= .5,1,0)

caret::confusionMatrix(as.factor(BankTest3$y), as.factor(BankTest3$PredChoice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 643   9
##          1  42  32
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.9087, 0.9473)
##     No Information Rate : 0.9435          
##     P-Value [Acc > NIR] : 0.9505          
##                                           
##                   Kappa : 0.5218          
##                                           
##  Mcnemar's Test P-Value : 7.433e-06       
##                                           
##             Sensitivity : 0.9387          
##             Specificity : 0.7805          
##          Pos Pred Value : 0.9862          
##          Neg Pred Value : 0.4324          
##              Prevalence : 0.9435          
##          Detection Rate : 0.8857          
##    Detection Prevalence : 0.8981          
##       Balanced Accuracy : 0.8596          
##                                           
##        'Positive' Class : 0               
##

#LDA Model Run

model.lda = lda(formula = as.factor(y) ~  age + job + marital + 
                  education + default + loan + contact + month + 
                     day_of_week + poutcome,data = BankTrain1)

# View the output 
model.lda

## Call:
## lda(as.factor(y) ~ age + job + marital + education + default + 
##     loan + contact + month + day_of_week + poutcome, data = BankTrain1)
## 
## Prior probabilities of groups:
##         0         1 
## 0.8981768 0.1018232 
## 
## Group means:
##        age jobblue-collar jobentrepreneur jobhousemaid jobmanagement
## 0 39.77748      0.2202221      0.03944849   0.02527767    0.07851398
## 1 39.37838      0.1385135      0.02027027   0.01689189    0.06756757
##   jobretired jobself-employed jobservices jobstudent jobtechnician
## 0 0.02719265       0.03791651  0.09957871 0.00842589     0.1734967
## 1 0.03716216       0.03378378  0.07770270 0.03378378     0.1959459
##   jobunemployed maritalmarried maritalsingle educationbasic.6y
## 0    0.02642666      0.6261969     0.2619686        0.06089621
## 1    0.04054054      0.5270270     0.3614865        0.03040541
##   educationbasic.9y educationhigh.school educationprofessional.course
## 0         0.1443891            0.2366909                    0.1355802
## 1         0.1081081            0.2195946                    0.1385135
##   educationuniversity.degree defaultunknown   loanyes contacttelephone
## 0                  0.3182689      0.2049023 0.1711988        0.3872080
## 1                  0.4324324      0.1081081 0.1554054        0.1756757
##    monthaug    monthdec  monthjul  monthjun   monthmar  monthmay  monthnov
## 0 0.1635389 0.002680965 0.1719648 0.1260054 0.00536193 0.3554194 0.1122175
## 1 0.1452703 0.020270270 0.1317568 0.1621622 0.05743243 0.2195946 0.1081081
##     monthoct   monthsep day_of_weekmon day_of_weekthu day_of_weektue
## 0 0.00842589 0.00842589      0.2022214      0.2098813      0.2010724
## 1 0.03716216 0.04729730      0.2094595      0.2128378      0.1993243
##   day_of_weekwed poutcomenonexistent poutcomesuccess
## 0      0.2014554           0.8885484      0.01302183
## 1      0.1587838           0.6722973      0.19594595
## 
## Coefficients of linear discriminants:
##                                       LD1
## age                           0.004743784
## jobblue-collar               -0.185631813
## jobentrepreneur              -0.273007137
## jobhousemaid                 -0.106720287
## jobmanagement                -0.361416838
## jobretired                    0.128865224
## jobself-employed             -0.213913023
## jobservices                  -0.171665896
## jobstudent                    1.377439975
## jobtechnician                 0.061678678
## jobunemployed                -0.036977638
## maritalmarried               -0.091722715
## maritalsingle                 0.029393543
## educationbasic.6y            -0.143832768
## educationbasic.9y            -0.041575973
## educationhigh.school         -0.079895270
## educationprofessional.course -0.089699435
## educationuniversity.degree    0.142432916
## defaultunknown               -0.077533958
## loanyes                      -0.007371146
## contacttelephone             -0.867130702
## monthaug                     -0.719005010
## monthdec                      1.543193662
## monthjul                     -0.573400104
## monthjun                      0.485066255
## monthmar                      2.874203490
## monthmay                     -0.254035976
## monthnov                     -0.606019413
## monthoct                      1.301459115
## monthsep                      1.192507712
## day_of_weekmon               -0.120561510
## day_of_weekthu               -0.140554231
## day_of_weektue               -0.098890320
## day_of_weekwed               -0.183019461
## poutcomenonexistent           0.002850452
## poutcomesuccess               4.314598659

# Predictions
predictions.lda = predict(model.lda, BankTest3)

# Make confusion matrix for the LDA predictions to compare accuracy 
caret::confusionMatrix(as.factor(predictions.lda$class), as.factor(BankTest3$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 641  58
##          1  11  16
##                                           
##                Accuracy : 0.905           
##                  95% CI : (0.8813, 0.9253)
##     No Information Rate : 0.8981          
##     P-Value [Acc > NIR] : 0.2944          
##                                           
##                   Kappa : 0.2775          
##                                           
##  Mcnemar's Test P-Value : 3.064e-08       
##                                           
##             Sensitivity : 0.9831          
##             Specificity : 0.2162          
##          Pos Pred Value : 0.9170          
##          Neg Pred Value : 0.5926          
##              Prevalence : 0.8981          
##          Detection Rate : 0.8829          
##    Detection Prevalence : 0.9628          
##       Balanced Accuracy : 0.5997          
##                                           
##        'Positive' Class : 0               
##

#Accuracy = .905

##Check out glm trained with cleaned up data
names(final$coefficients)

##  [1] "(Intercept)"         "contacttelephone"    "monthaug"           
##  [4] "monthdec"            "monthjul"            "monthjun"           
##  [7] "monthmar"            "monthmay"            "monthnov"           
## [10] "monthoct"            "monthsep"            "duration"           
## [13] "campaign"            "previous"            "poutcomenonexistent"
## [16] "poutcomesuccess"

par(mfrow=c(3,2))
#Plot of residuals, want to make sure they are normal
plot(density(resid(final, type='response')), 
     main = "Residuals")

#Check independence of observations
scatter.smooth(rstandard(final, type='deviance'), 
               col='gray', main = "Observations")

#Plot of residuals fitted against fitted values of y
scatter.smooth(predict(final, type='response'), 
               rstandard(final, type='deviance'), col='gray', main="Residuals vs Fitted Values of Y")

#Cook's D
plot(cooks.distance(final), type='h', main = "Cook's D")

#Checking out deviance residuals
plot(density(resid(final, type='deviance')), 
     main = "Deviance of Residuals")

Data Exploration

nrow(BankData[BankData$age < 10, ])
nrow(BankData[BankData$age < 20, ])
nrow(BankData[BankData$age < 25, ])
nrow(BankData[BankData$age < 30, ])

nrow(BankData[BankData$age > 50, ])
nrow(BankData[BankData$age > 55, ])
nrow(BankData[BankData$age > 60, ])
nrow(BankData[BankData$age > 65, ])
nrow(BankData[BankData$age > 70, ])

ggplot(data = BankData, aes(x = age)) + 
  geom_histogram(binwidth = 5, col = "white") 
#There appears to be very few observations where the customer's age is less than 25 or greater than 60.

dim(BD2)
nrow(BD2[BD2$age < 25, ])
nrow(BD2[BD2$age > 60, ])

ggplot(data = BD2, aes(x = age)) + geom_histogram(binwidth = 5, col = "white") 

ggplot(BD2) + geom_histogram(aes(x = age), binwidth = 0.1, col = "white") +
  facet_grid(y~., scales = "free") + scale_x_log10()

#Job Var
ggplot(data = BD2, aes(x = job)) + geom_bar()
nrow(BD2[BD2$job == 'unknown', ])

nrow(BD2[BD2$job == 'unknown', ])
ggplot(data = BD2, aes(x = job)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = job), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Marital Variable
ggplot(data = BD2, aes(x = marital)) + geom_bar()
nrow(BD2[BD2$marital == 'unknown', ])

nrow(BD2[BD2$marital == 'unknown', ])

ggplot(BD2) + geom_bar(aes(x = marital), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Education Variable
ggplot(data = BD2, aes(x = education)) + geom_bar()

nrow(BD2[BD2$education == 'unknown', ])
nrow(BD2[BD2$education == 'illiterate', ])


dim(BD2)

ggplot(BD2) + geom_bar(aes(x = education), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Default Variable
ggplot(data = BD2, aes(x = default)) + geom_bar()

nrow(BD2[BD2$default == 'unknown', ])
nrow(BD2[BD2$default == 'yes', ])


nrow(BD2[BD2$default == 'yes', ])
ggplot(data = BD2, aes(x = default)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = default), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Housing Variable
ggplot(data = BD2, aes(x = housing)) + geom_bar()
nrow(BD2[BD2$housing == 'unknown', ])


ggplot(BD2) + geom_bar(aes(x = housing), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


#Month Variable
ggplot(data = BD2, aes(x = month)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = month), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Loan Variable
ggplot(data = BD2, aes(x = loan)) + geom_bar()
nrow(BD2[BD2$loan == 'unknown', ])

ggplot(BD2) + geom_bar(aes(x = loan), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Contact Variable
ggplot(data = BD2, aes(x = contact)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = contact), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

##day_of_week variable##
ggplot(data = BD2, aes(x = day_of_week)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = day_of_week), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#duration
ggplot(data = BD2, aes(x = duration)) + geom_bar()

#campaign
ggplot(data = BD2, aes(x = campaign)) + geom_bar()

#pdays
ggplot(data = BD2, aes(x = pdays)) + geom_bar()
# almost all not contacted...

#previous
ggplot(data = BD2, aes(x = previous)) + geom_bar()

##poutcome variable##
ggplot(data = BD2, aes(x = poutcome)) + geom_bar()

##poutcome variable##
ggplot(data = BD2, aes(x = poutcome)) + geom_bar()


ggplot(BD2) + geom_bar(aes(x = poutcome), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#emp.var.rate
ggplot(data = BD2, aes(x = emp.var.rate)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = emp.var.rate), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#cons.price.idx
ggplot(data = BD2, aes(x = cons.price.idx)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = cons.price.idx), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#con.conf.idx
ggplot(data = BD2, aes(x = cons.conf.idx)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = cons.conf.idx), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#euribor3m
ggplot(data = BD2, aes(x = euribor3m)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = euribor3m), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#nr.employed
ggplot(data = BD2, aes(x = nr.employed)) + geom_bar()

ggplot(BD2) + geom_bar(aes(x = nr.employed), col = "white") +
  facet_grid(y~., scales = "free")  + theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Bank Case Study

Karla Jimenez - EBT428

Gregory Kresge - EOW378

Dylan McCorquodale - LLQ162

Jiwhan Yoo - CBJ994

9/6/2020

Executive Summary

The Problem

Methodology

Data

Findings

Conclusion and Recommendations

Citations

Appendix and Supplemental Code

Model Fitting

Data Exploration

Bank Case Study

Karla Jimenez - EBT428

Gregory Kresge - EOW378

Dylan McCorquodale - LLQ162

Jiwhan Yoo - CBJ994

9/6/2020

Executive Summary

The Problem

Review of Related Literature

Methodology

Data

Findings

Conclusion and Recommendations

Citations

Appendix and Supplemental Code

Model Fitting

Data Exploration