In order to be profitable, a business must reduce cost and target the right customers. This holds true for any organization or company, and with the growing field of data and analytics, the ability to create targeted marketing campaigns is a potentially lucrative investment that should be considered by any organization interested in making the most out of the stores of data at their fingertips.
The purpose of this case study is to highlight the importance of strategic and data driven predictive models and predict if a client will subscribe to a term deposit with a Portuguese Bank. Data for this case study comes from the UCI Machine Learning Repository, it contains information based on the Portuguese Bank’s telemarketing campaign, and can be obtained at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The dataset will be used to explore questions such as: What are the best predictors of whether or not a client subscribed to a term deposit? How accurate is the model at predicting our target variable? What are the main drivers for determining whether a client will or won’t subscribe to a term deposit?
A logistic model resulted in the highest accuracy and could best predict whether or not a client would subscribe to a term deposit with the Bank. The model predicts if a customer will purchase a term deposit product based on:
As we delve into the data and determine the best model for this use case, answer questions, and uncover the appropriate levers to pull, we will be better able to translate the results into data driven insights as well as actions that the business can directly use and take to best guide their marketing campaign.
This will be accomplished through data investigation, data cleaning, model building and interpretation. The insights uncovered will rely on the appropriate interpretation of the model which most accurately determine whether or not a client will subscribe to a term deposit. The Portuguese Bank Marketing Data set will be used to train and test various predictive models within R. In order to best serve the bank, the goal of the models built and examined will be accuracy, consistency, and reliability.
The best model will be identified through model interpretation of model statistics such as accuracy, AIC, p-values and beta coefficients. In order to best guide and inform the Portuguese Bank, the outputs will be closely examined to determine what the best predictors of our target variable are, and to provide recommendations for a more targeted marketing campaign which will yield the best results and drive the most profit. To best accomplish the ability to provide actionable insights, the data pre-processing, cleaning and eventual model interpretation will need to rely on best practices and existing techniques known to work for predictive model building.
In this case study, a logistic model and LDA will be trained to predict if a customer will purchase a term deposit product. The predicted outcome variable will be a binary yes/no for if someone is a respondent or not. The dataset used for this analysis is related to the direct marketing campaigns of Portuguese banks and is a randomly selected ten percent sample of the full dataset. The dataset contains 4119 observations and 21 variables – a binary output variable y and 20 input variables, of which 10 are continuous and 10 are categorical. The categorical variables will each need n-1 dummy variables, where n is the number of categories, in order to be used in the model. Significant variables will be found using backward selection, however, an analysis of the selected variables will be conducted to ensure the significance of variables in the model.
Plots will be used to identify potentially significant input variables, possible interactions or multicollinearity between variables, and help identify missing our outlier observations. Outliers, and missing data points will be omitted or handled as specified in the below Data section. To train and test the model, the data will be divided into training and testing sets using an 80/20 split, resulting in a training set with 2907 observations and a test set with 726 observations.
As with other parametric methods, logistic models have several assumptions that must be satisfied to ensure an accurate model is built. These assumptions are:
Additionally, logistic models may be over-fit on the training data, particularly when working with high-dimensional datasets with relatively few observations. This can result in a high predicted accuracy with the training data, but poor accuracy when applied to real-world data. Due to the large number of observations in the bank dataset, over-fitting is not likely to be a concern in this analysis, but will still be checked for.
The dataset consisted of 4119 observations and 21 total variables including the response variable, y. Initial analysis showed the data was fairly clean and had no missing values in any of the variables, there were multiple “unknown” responses in many of the categorical variables. These unknown values were examined on a case-by-case basis and (removed in the event that the number of observations within this category was comparatively low). Along with the unknown values, other opportunities to clean up some of the variables prior to training and running our models were found. There was a single “illiterate” response in the Education variable, and it was found that there were relatively few observations where age was below 25 or above 60. All data cleaning and pre-processing steps are:
The final result for the categorical variables were factored again using the cleaned dataset. A total of 465 observations were excluded leaving our “clean” dataset with 3654 observations.
None of the social and economic variables were included in the dataset that the model was built from. This was done to keep the focus of the model on the potential customer, and it was found that an accuracte model could be trained without these additional attributes.
Our analysis resulted in a logistic model having the highest accuracy in predicting the target variable. The final model has a relatively high accuracy rate of 92.98% and the following formula: \(y = -.22*maritalmarried + .15*maritalsingle - 1.67*contacttelephone - .24*monthaug + 1.37*monthdec - .81*monthjul + 1.44*monthjun + 2.39*monthmar - .23*monthmay - .77*monthnov + 1.7*monthoct + 1.29*monthsep + .0049*duration -.11*campaign +.38*previous + .18*poutcomenonexistent + 2.51*poutcomesuccess\)
This can be interpreted as:
The confusion matrix for this model against the test data shows that this model has a high accuracy rate, and is particularly good at predicting non-responders, or those who would not purchase a term deposit product.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 643 9
## 1 42 32
##
## Accuracy : 0.9298
## 95% CI : (0.9087, 0.9473)
## No Information Rate : 0.9435
## P-Value [Acc > NIR] : 0.9505
##
## Kappa : 0.5218
##
## Mcnemar's Test P-Value : 7.433e-06
##
## Sensitivity : 0.9387
## Specificity : 0.7805
## Pos Pred Value : 0.9862
## Neg Pred Value : 0.4324
## Prevalence : 0.9435
## Detection Rate : 0.8857
## Detection Prevalence : 0.8981
## Balanced Accuracy : 0.8596
##
## 'Positive' Class : 0
##
A LDA model was also fit, but the logistic model was found to have a higher test accuracy.
Based on the below residual plots, the residuals appear to be approximately normally distributed. Additionally, looking at the Cook’s distance, while there are some observations with a large distance, none appear to be extreme outliers.
The model initially built using backward selection included all of the variables in the final model outlined above, as well as the Marital variable. However, further analysis of p-values given by the model output showed that marital was insignificant. After removing the marital variable model performance improved by .97 %
Before determining that the preferred model would be built through logistic regression, an LDA model was trained and tested with the same pre-processed dataset and identified variables. The LDA model performed marginally worse than the linear model, with an accuracy of 92.01%. Given that even 1% improvement could potentially lead to a significant increase in profits, we chose the linear model over the LDA.
In conclusion, we recommend using the logistic model described above to predict if the client will respond or not to a direct marketing campaign and purchase a term deposit. As mentioned above, this model had the greatest prediction power compared to other models tested, and is particularly good at predicting non-responders. This will allow for greater focus on customers who are more likely to purchase a product and will save time and money.
Despite efforts made to gain the most predictive power out of the model identified, there are opportunities to improve upon our model, methodology, and data. One way would be to better understand the specifics of the marketing campaign so that the analysis could be more tailored and better focused. Additional analysis into what factors might lead a customer to not purchase a term deposit in this marketing campaign but purchase one in the next campaign would be useful as well. The model shows that the number of times a customer was contacted in a previous campaign impacts their decision in the current campaign, so better understanding of the factors that impact that decision would be very helpful and could lead to greater insights and consequent recommendations to the business which would prove to be profitable.
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Babakus, E., Eroglu, S., & Yavas, U. (2004). Modeling consumers’ choice behavior: an application in banking. Journal of Services Marketing, 18(6), 462–470. https://doi.org/10.1108/08876040410557249
BD2$job <- factor(BD2$job)
BD2$marital <- factor(BD2$marital)
BD2$education <- factor(BD2$education)
BD2$default <- factor(BD2$default)
BD2$housing <- factor(BD2$housing)
BD2$loan <- factor(BD2$loan)
BD2$contact <- factor(BD2$contact)
BD2$month <- factor(BD2$month)
BD2$day_of_week <- factor(BD2$day_of_week)
BD2$poutcome <- factor(BD2$poutcome)
str(BD2)
## 'data.frame': 3633 obs. of 21 variables:
## $ age : int 30 39 25 47 32 32 41 31 35 25 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 2 8 8 1 8 1 3 8 2 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 3 2 1 2 3 ...
## $ education : Factor w/ 6 levels "basic.4y","basic.6y",..: 3 4 4 6 6 6 6 5 3 2 ...
## $ default : Factor w/ 2 levels "no","unknown": 1 1 1 1 1 1 2 1 2 2 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 1 1 1 1 1 2 1 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 8 10 10 8 8 7 4 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 2 3 2 2 4 3 3 ...
## $ duration : int 487 346 227 58 128 290 44 68 170 301 ...
## $ campaign : int 2 4 1 1 3 4 2 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 2 0 0 1 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 1 2 2 1 2 2 ...
## $ emp.var.rate : num -1.8 1.1 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 1.4 ...
## $ cons.price.idx: num 92.9 94 94.5 93.2 94.2 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 -42.7 ...
## $ euribor3m : num 1.313 4.855 4.962 4.191 0.884 ...
## $ nr.employed : num 5099 5191 5228 5196 4964 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
BD2 = BD2[c(1:15,21)]
str(BD2)
## 'data.frame': 3633 obs. of 16 variables:
## $ age : int 30 39 25 47 32 32 41 31 35 25 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 2 8 8 1 8 1 3 8 2 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 3 2 1 2 3 ...
## $ education : Factor w/ 6 levels "basic.4y","basic.6y",..: 3 4 4 6 6 6 6 5 3 2 ...
## $ default : Factor w/ 2 levels "no","unknown": 1 1 1 1 1 1 2 1 2 2 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 1 1 1 1 1 2 1 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 8 10 10 8 8 7 4 ...
## $ day_of_week: Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 2 3 2 2 4 3 3 ...
## $ duration : int 487 346 227 58 128 290 44 68 170 301 ...
## $ campaign : int 2 4 1 1 3 4 2 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 2 0 0 1 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 1 2 2 1 2 2 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
set.seed(1)
inTrain<-createDataPartition(BD2$y,p=0.8,list=FALSE)
BankTrain1=BD2[inTrain,]
BankTest1=BD2[-inTrain,]
BankTrain1$y <- ifelse(BankTrain1$y == 'yes',1,0)
BankTest1$y <- ifelse(BankTest1$y == 'yes',1,0)
full.model = glm(formula = y ~ ., data = BankTrain1, family = binomial)
summary(full.model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = BankTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9836 -0.3203 -0.2161 -0.1245 2.8379
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2688353 1.2259843 -1.851 0.064224 .
## age -0.0012974 0.0106139 -0.122 0.902713
## jobblue-collar -0.2149344 0.3238015 -0.664 0.506828
## jobentrepreneur -0.8631008 0.5910161 -1.460 0.144189
## jobhousemaid 0.2238285 0.5414284 0.413 0.679311
## jobmanagement -0.3002416 0.3322662 -0.904 0.366198
## jobretired -0.0008408 0.4829278 -0.002 0.998611
## jobself-employed -0.4818882 0.4575717 -1.053 0.292276
## jobservices 0.2141980 0.3278194 0.653 0.513497
## jobstudent 1.3787048 0.5677547 2.428 0.015168 *
## jobtechnician 0.2451205 0.2569624 0.954 0.340126
## jobunemployed 0.3751666 0.4450409 0.843 0.399232
## maritalmarried -0.1729109 0.2676393 -0.646 0.518241
## maritalsingle -0.0227413 0.3031660 -0.075 0.940205
## educationbasic.6y -0.4467838 0.5077688 -0.880 0.378916
## educationbasic.9y -0.2452010 0.3763144 -0.652 0.514669
## educationhigh.school -0.2523120 0.3703567 -0.681 0.495702
## educationprofessional.course -0.2177957 0.4071596 -0.535 0.592709
## educationuniversity.degree 0.1740951 0.3751156 0.464 0.642568
## defaultunknown -0.0014664 0.2453144 -0.006 0.995231
## housingyes -0.1098561 0.1611457 -0.682 0.495416
## loanyes -0.1526658 0.2176320 -0.701 0.483000
## contacttelephone -1.6557381 0.2528481 -6.548 5.82e-11 ***
## monthaug -0.3100136 0.3701168 -0.838 0.402250
## monthdec 1.2927506 0.7904988 1.635 0.101973
## monthjul -0.8140762 0.3836824 -2.122 0.033859 *
## monthjun 1.4072900 0.3968341 3.546 0.000391 ***
## monthmar 2.2971836 0.5441679 4.221 2.43e-05 ***
## monthmay -0.1598098 0.3517958 -0.454 0.649635
## monthnov -0.7872673 0.3969035 -1.984 0.047309 *
## monthoct 1.5265833 0.5480786 2.785 0.005347 **
## monthsep 1.1295274 0.5613008 2.012 0.044184 *
## day_of_weekmon -0.0246007 0.2453536 -0.100 0.920133
## day_of_weekthu -0.1613287 0.2453545 -0.658 0.510838
## day_of_weektue -0.1615139 0.2524773 -0.640 0.522357
## day_of_weekwed -0.2006646 0.2623784 -0.765 0.444396
## duration 0.0049794 0.0002893 17.210 < 2e-16 ***
## campaign -0.1207924 0.0497010 -2.430 0.015083 *
## pdays -0.0009569 0.0009024 -1.060 0.288969
## previous 0.2765877 0.2186057 1.265 0.205787
## poutcomenonexistent 0.1167579 0.3660234 0.319 0.749734
## poutcomesuccess 1.6406991 0.8806517 1.863 0.062455 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1913.2 on 2906 degrees of freedom
## Residual deviance: 1168.3 on 2865 degrees of freedom
## AIC: 1252.3
##
## Number of Fisher Scoring iterations: 6
formula(full.model)
## y ~ age + job + marital + education + default + housing + loan +
## contact + month + day_of_week + duration + campaign + pdays +
## previous + poutcome
coef(full.model)
## (Intercept) age
## -2.2688353294 -0.0012973835
## jobblue-collar jobentrepreneur
## -0.2149343636 -0.8631007616
## jobhousemaid jobmanagement
## 0.2238284810 -0.3002416362
## jobretired jobself-employed
## -0.0008407974 -0.4818882313
## jobservices jobstudent
## 0.2141979987 1.3787048425
## jobtechnician jobunemployed
## 0.2451204821 0.3751666080
## maritalmarried maritalsingle
## -0.1729109304 -0.0227413151
## educationbasic.6y educationbasic.9y
## -0.4467838136 -0.2452009743
## educationhigh.school educationprofessional.course
## -0.2523120303 -0.2177957128
## educationuniversity.degree defaultunknown
## 0.1740951269 -0.0014663677
## housingyes loanyes
## -0.1098561292 -0.1526658257
## contacttelephone monthaug
## -1.6557381465 -0.3100135548
## monthdec monthjul
## 1.2927505652 -0.8140761735
## monthjun monthmar
## 1.4072899992 2.2971836372
## monthmay monthnov
## -0.1598098032 -0.7872672748
## monthoct monthsep
## 1.5265832986 1.1295273823
## day_of_weekmon day_of_weekthu
## -0.0246006730 -0.1613287415
## day_of_weektue day_of_weekwed
## -0.1615139186 -0.2006645811
## duration campaign
## 0.0049794469 -0.1207923723
## pdays previous
## -0.0009569390 0.2765876676
## poutcomenonexistent poutcomesuccess
## 0.1167578724 1.6406990552
BankTrain1$PredProb = predict.glm(full.model,newdata = BankTrain1, type = 'response')
BankTrain1$PredChoice = ifelse(BankTrain1$PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(BankTrain1$y), as.factor(BankTrain1$PredChoice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2558 53
## 1 187 109
##
## Accuracy : 0.9174
## 95% CI : (0.9068, 0.9272)
## No Information Rate : 0.9443
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4353
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9319
## Specificity : 0.6728
## Pos Pred Value : 0.9797
## Neg Pred Value : 0.3682
## Prevalence : 0.9443
## Detection Rate : 0.8799
## Detection Prevalence : 0.8982
## Balanced Accuracy : 0.8024
##
## 'Positive' Class : 0
##
full.model.train.acc = .9174
BankTest1$PredProb = predict.glm(full.model,newdata = BankTest1, type = 'response')
BankTest1$PredChoice = ifelse(BankTest1$PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(BankTest1$y), as.factor(BankTest1$PredChoice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 641 11
## 1 44 30
##
## Accuracy : 0.9242
## 95% CI : (0.9025, 0.9424)
## No Information Rate : 0.9435
## P-Value [Acc > NIR] : 0.9875
##
## Kappa : 0.4843
##
## Mcnemar's Test P-Value : 1.597e-05
##
## Sensitivity : 0.9358
## Specificity : 0.7317
## Pos Pred Value : 0.9831
## Neg Pred Value : 0.4054
## Prevalence : 0.9435
## Detection Rate : 0.8829
## Detection Prevalence : 0.8981
## Balanced Accuracy : 0.8337
##
## 'Positive' Class : 0
##
full.model.test.acc = .9242
set.seed(1)
back.model = step(full.model,trace=0)
summary(back.model)
##
## Call:
## glm(formula = y ~ marital + contact + month + duration + campaign +
## previous + poutcome, family = binomial, data = BankTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9756 -0.3232 -0.2250 -0.1317 2.9492
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6160581 0.4898341 -7.382 1.56e-13 ***
## maritalmarried -0.2204167 0.2627254 -0.839 0.401490
## maritalsingle 0.1506302 0.2753804 0.547 0.584386
## contacttelephone -1.6660945 0.2471946 -6.740 1.58e-11 ***
## monthaug -0.2351446 0.3573054 -0.658 0.510470
## monthdec 1.3693569 0.7650077 1.790 0.073455 .
## monthjul -0.8117587 0.3708845 -2.189 0.028618 *
## monthjun 1.4405179 0.3840940 3.750 0.000177 ***
## monthmar 2.3857575 0.5200904 4.587 4.49e-06 ***
## monthmay -0.2308838 0.3375313 -0.684 0.493952
## monthnov -0.7730728 0.3845480 -2.010 0.044395 *
## monthoct 1.6974745 0.5312758 3.195 0.001398 **
## monthsep 1.2910765 0.5488018 2.353 0.018646 *
## duration 0.0048860 0.0002816 17.351 < 2e-16 ***
## campaign -0.1089008 0.0482991 -2.255 0.024151 *
## previous 0.3676629 0.2015894 1.824 0.068179 .
## poutcomenonexistent 0.1776117 0.3582398 0.496 0.620043
## poutcomesuccess 2.5112497 0.3427697 7.326 2.37e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1913.2 on 2906 degrees of freedom
## Residual deviance: 1191.1 on 2889 degrees of freedom
## AIC: 1227.1
##
## Number of Fisher Scoring iterations: 6
formula(back.model)
## y ~ marital + contact + month + duration + campaign + previous +
## poutcome
coef(back.model)
## (Intercept) maritalmarried maritalsingle
## -3.616058085 -0.220416692 0.150630250
## contacttelephone monthaug monthdec
## -1.666094465 -0.235144556 1.369356855
## monthjul monthjun monthmar
## -0.811758739 1.440517945 2.385757489
## monthmay monthnov monthoct
## -0.230883834 -0.773072779 1.697474470
## monthsep duration campaign
## 1.291076476 0.004886002 -0.108900812
## previous poutcomenonexistent poutcomesuccess
## 0.367662878 0.177611705 2.511249711
BankTest2=BD2[-inTrain,]
BankTest2$y <- ifelse(BankTest2$y == 'yes',1,0)
back.model.test.acc = .9242
BankTest2$PredProb = predict.glm(back.model,newdata = BankTest2, type = 'response')
BankTest2$PredChoice = ifelse(BankTest2$PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(BankTest2$y), as.factor(BankTest2$PredChoice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 642 10
## 1 45 29
##
## Accuracy : 0.9242
## 95% CI : (0.9025, 0.9424)
## No Information Rate : 0.9463
## P-Value [Acc > NIR] : 0.9951
##
## Kappa : 0.4764
##
## Mcnemar's Test P-Value : 4.549e-06
##
## Sensitivity : 0.9345
## Specificity : 0.7436
## Pos Pred Value : 0.9847
## Neg Pred Value : 0.3919
## Prevalence : 0.9463
## Detection Rate : 0.8843
## Detection Prevalence : 0.8981
## Balanced Accuracy : 0.8390
##
## 'Positive' Class : 0
##
BankTest3=BD2[-inTrain,]
BankTest3$y <- ifelse(BankTest3$y == 'yes',1,0)
final = glm(formula = y ~contact + month + duration+
campaign+ previous+poutcome,
data = BankTrain1, family = binomial)
summary(final)
##
## Call:
## glm(formula = y ~ contact + month + duration + campaign + previous +
## poutcome, family = binomial, data = BankTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9575 -0.3211 -0.2291 -0.1301 2.8946
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7144288 0.4393945 -8.454 < 2e-16 ***
## contacttelephone -1.7156467 0.2464950 -6.960 3.40e-12 ***
## monthaug -0.1841470 0.3570103 -0.516 0.60599
## monthdec 1.3889177 0.7669722 1.811 0.07015 .
## monthjul -0.7576417 0.3701062 -2.047 0.04065 *
## monthjun 1.5213752 0.3838998 3.963 7.40e-05 ***
## monthmar 2.5167602 0.5157246 4.880 1.06e-06 ***
## monthmay -0.1969572 0.3376368 -0.583 0.55966
## monthnov -0.7495801 0.3847047 -1.948 0.05136 .
## monthoct 1.8201343 0.5252828 3.465 0.00053 ***
## monthsep 1.3840190 0.5448186 2.540 0.01107 *
## duration 0.0048717 0.0002812 17.324 < 2e-16 ***
## campaign -0.1075191 0.0480860 -2.236 0.02535 *
## previous 0.3743971 0.1994030 1.878 0.06044 .
## poutcomenonexistent 0.1584876 0.3565704 0.444 0.65670
## poutcomesuccess 2.4771524 0.3416229 7.251 4.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1913.2 on 2906 degrees of freedom
## Residual deviance: 1195.7 on 2891 degrees of freedom
## AIC: 1227.7
##
## Number of Fisher Scoring iterations: 6
BankTest3$PredProb = predict.glm(final,newdata = BankTest3, type = 'response')
BankTest3$PredChoice = ifelse(BankTest3$PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(BankTest3$y), as.factor(BankTest3$PredChoice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 643 9
## 1 42 32
##
## Accuracy : 0.9298
## 95% CI : (0.9087, 0.9473)
## No Information Rate : 0.9435
## P-Value [Acc > NIR] : 0.9505
##
## Kappa : 0.5218
##
## Mcnemar's Test P-Value : 7.433e-06
##
## Sensitivity : 0.9387
## Specificity : 0.7805
## Pos Pred Value : 0.9862
## Neg Pred Value : 0.4324
## Prevalence : 0.9435
## Detection Rate : 0.8857
## Detection Prevalence : 0.8981
## Balanced Accuracy : 0.8596
##
## 'Positive' Class : 0
##
#LDA Model Run
model.lda = lda(formula = as.factor(y) ~ age + job + marital +
education + default + loan + contact + month +
day_of_week + poutcome,data = BankTrain1)
# View the output
model.lda
## Call:
## lda(as.factor(y) ~ age + job + marital + education + default +
## loan + contact + month + day_of_week + poutcome, data = BankTrain1)
##
## Prior probabilities of groups:
## 0 1
## 0.8981768 0.1018232
##
## Group means:
## age jobblue-collar jobentrepreneur jobhousemaid jobmanagement
## 0 39.77748 0.2202221 0.03944849 0.02527767 0.07851398
## 1 39.37838 0.1385135 0.02027027 0.01689189 0.06756757
## jobretired jobself-employed jobservices jobstudent jobtechnician
## 0 0.02719265 0.03791651 0.09957871 0.00842589 0.1734967
## 1 0.03716216 0.03378378 0.07770270 0.03378378 0.1959459
## jobunemployed maritalmarried maritalsingle educationbasic.6y
## 0 0.02642666 0.6261969 0.2619686 0.06089621
## 1 0.04054054 0.5270270 0.3614865 0.03040541
## educationbasic.9y educationhigh.school educationprofessional.course
## 0 0.1443891 0.2366909 0.1355802
## 1 0.1081081 0.2195946 0.1385135
## educationuniversity.degree defaultunknown loanyes contacttelephone
## 0 0.3182689 0.2049023 0.1711988 0.3872080
## 1 0.4324324 0.1081081 0.1554054 0.1756757
## monthaug monthdec monthjul monthjun monthmar monthmay monthnov
## 0 0.1635389 0.002680965 0.1719648 0.1260054 0.00536193 0.3554194 0.1122175
## 1 0.1452703 0.020270270 0.1317568 0.1621622 0.05743243 0.2195946 0.1081081
## monthoct monthsep day_of_weekmon day_of_weekthu day_of_weektue
## 0 0.00842589 0.00842589 0.2022214 0.2098813 0.2010724
## 1 0.03716216 0.04729730 0.2094595 0.2128378 0.1993243
## day_of_weekwed poutcomenonexistent poutcomesuccess
## 0 0.2014554 0.8885484 0.01302183
## 1 0.1587838 0.6722973 0.19594595
##
## Coefficients of linear discriminants:
## LD1
## age 0.004743784
## jobblue-collar -0.185631813
## jobentrepreneur -0.273007137
## jobhousemaid -0.106720287
## jobmanagement -0.361416838
## jobretired 0.128865224
## jobself-employed -0.213913023
## jobservices -0.171665896
## jobstudent 1.377439975
## jobtechnician 0.061678678
## jobunemployed -0.036977638
## maritalmarried -0.091722715
## maritalsingle 0.029393543
## educationbasic.6y -0.143832768
## educationbasic.9y -0.041575973
## educationhigh.school -0.079895270
## educationprofessional.course -0.089699435
## educationuniversity.degree 0.142432916
## defaultunknown -0.077533958
## loanyes -0.007371146
## contacttelephone -0.867130702
## monthaug -0.719005010
## monthdec 1.543193662
## monthjul -0.573400104
## monthjun 0.485066255
## monthmar 2.874203490
## monthmay -0.254035976
## monthnov -0.606019413
## monthoct 1.301459115
## monthsep 1.192507712
## day_of_weekmon -0.120561510
## day_of_weekthu -0.140554231
## day_of_weektue -0.098890320
## day_of_weekwed -0.183019461
## poutcomenonexistent 0.002850452
## poutcomesuccess 4.314598659
# Predictions
predictions.lda = predict(model.lda, BankTest3)
# Make confusion matrix for the LDA predictions to compare accuracy
caret::confusionMatrix(as.factor(predictions.lda$class), as.factor(BankTest3$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 641 58
## 1 11 16
##
## Accuracy : 0.905
## 95% CI : (0.8813, 0.9253)
## No Information Rate : 0.8981
## P-Value [Acc > NIR] : 0.2944
##
## Kappa : 0.2775
##
## Mcnemar's Test P-Value : 3.064e-08
##
## Sensitivity : 0.9831
## Specificity : 0.2162
## Pos Pred Value : 0.9170
## Neg Pred Value : 0.5926
## Prevalence : 0.8981
## Detection Rate : 0.8829
## Detection Prevalence : 0.9628
## Balanced Accuracy : 0.5997
##
## 'Positive' Class : 0
##
#Accuracy = .905
##Check out glm trained with cleaned up data
names(final$coefficients)
## [1] "(Intercept)" "contacttelephone" "monthaug"
## [4] "monthdec" "monthjul" "monthjun"
## [7] "monthmar" "monthmay" "monthnov"
## [10] "monthoct" "monthsep" "duration"
## [13] "campaign" "previous" "poutcomenonexistent"
## [16] "poutcomesuccess"
par(mfrow=c(3,2))
#Plot of residuals, want to make sure they are normal
plot(density(resid(final, type='response')),
main = "Residuals")
#Check independence of observations
scatter.smooth(rstandard(final, type='deviance'),
col='gray', main = "Observations")
#Plot of residuals fitted against fitted values of y
scatter.smooth(predict(final, type='response'),
rstandard(final, type='deviance'), col='gray', main="Residuals vs Fitted Values of Y")
#Cook's D
plot(cooks.distance(final), type='h', main = "Cook's D")
#Checking out deviance residuals
plot(density(resid(final, type='deviance')),
main = "Deviance of Residuals")
nrow(BankData[BankData$age < 10, ])
nrow(BankData[BankData$age < 20, ])
nrow(BankData[BankData$age < 25, ])
nrow(BankData[BankData$age < 30, ])
nrow(BankData[BankData$age > 50, ])
nrow(BankData[BankData$age > 55, ])
nrow(BankData[BankData$age > 60, ])
nrow(BankData[BankData$age > 65, ])
nrow(BankData[BankData$age > 70, ])
ggplot(data = BankData, aes(x = age)) +
geom_histogram(binwidth = 5, col = "white")
#There appears to be very few observations where the customer's age is less than 25 or greater than 60.
dim(BD2)
nrow(BD2[BD2$age < 25, ])
nrow(BD2[BD2$age > 60, ])
ggplot(data = BD2, aes(x = age)) + geom_histogram(binwidth = 5, col = "white")
ggplot(BD2) + geom_histogram(aes(x = age), binwidth = 0.1, col = "white") +
facet_grid(y~., scales = "free") + scale_x_log10()
#Job Var
ggplot(data = BD2, aes(x = job)) + geom_bar()
nrow(BD2[BD2$job == 'unknown', ])
nrow(BD2[BD2$job == 'unknown', ])
ggplot(data = BD2, aes(x = job)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = job), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Marital Variable
ggplot(data = BD2, aes(x = marital)) + geom_bar()
nrow(BD2[BD2$marital == 'unknown', ])
nrow(BD2[BD2$marital == 'unknown', ])
ggplot(BD2) + geom_bar(aes(x = marital), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Education Variable
ggplot(data = BD2, aes(x = education)) + geom_bar()
nrow(BD2[BD2$education == 'unknown', ])
nrow(BD2[BD2$education == 'illiterate', ])
dim(BD2)
ggplot(BD2) + geom_bar(aes(x = education), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Default Variable
ggplot(data = BD2, aes(x = default)) + geom_bar()
nrow(BD2[BD2$default == 'unknown', ])
nrow(BD2[BD2$default == 'yes', ])
nrow(BD2[BD2$default == 'yes', ])
ggplot(data = BD2, aes(x = default)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = default), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Housing Variable
ggplot(data = BD2, aes(x = housing)) + geom_bar()
nrow(BD2[BD2$housing == 'unknown', ])
ggplot(BD2) + geom_bar(aes(x = housing), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Month Variable
ggplot(data = BD2, aes(x = month)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = month), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Loan Variable
ggplot(data = BD2, aes(x = loan)) + geom_bar()
nrow(BD2[BD2$loan == 'unknown', ])
ggplot(BD2) + geom_bar(aes(x = loan), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Contact Variable
ggplot(data = BD2, aes(x = contact)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = contact), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
##day_of_week variable##
ggplot(data = BD2, aes(x = day_of_week)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = day_of_week), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#duration
ggplot(data = BD2, aes(x = duration)) + geom_bar()
#campaign
ggplot(data = BD2, aes(x = campaign)) + geom_bar()
#pdays
ggplot(data = BD2, aes(x = pdays)) + geom_bar()
# almost all not contacted...
#previous
ggplot(data = BD2, aes(x = previous)) + geom_bar()
##poutcome variable##
ggplot(data = BD2, aes(x = poutcome)) + geom_bar()
##poutcome variable##
ggplot(data = BD2, aes(x = poutcome)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = poutcome), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#emp.var.rate
ggplot(data = BD2, aes(x = emp.var.rate)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = emp.var.rate), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#cons.price.idx
ggplot(data = BD2, aes(x = cons.price.idx)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = cons.price.idx), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#con.conf.idx
ggplot(data = BD2, aes(x = cons.conf.idx)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = cons.conf.idx), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#euribor3m
ggplot(data = BD2, aes(x = euribor3m)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = euribor3m), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#nr.employed
ggplot(data = BD2, aes(x = nr.employed)) + geom_bar()
ggplot(BD2) + geom_bar(aes(x = nr.employed), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))