The goal for this data set is to be able to predict which customers will subscribe to a term deposit based upon specific qualities and characteristics of the customer and methods used by the campaign. The data set is made up of 45,211 observations and contains 17 variables. Among all of the variables, there were no missing values or null values found. The variables and their descriptions are listed below:
1 - age of customer(numeric)
2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
3 - marital : marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)
4 - education (categorical: “unknown”, “secondary”, “primary”, “tertiary”)
5 - default: has credit in default? (binary: “yes”, “no”)
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: “yes”, “no”)
8 - loan: has personal loan? (binary: “yes”, “no”) # related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: “unknown”, “telephone”, “cellular”)
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds (numeric) # other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)
Output variable:
17 – y - has the client subscribed a term deposit? (binary: “yes”, “no”)
Boxplots for EDA
Histograms for EDA
Histograms for Transformed Features
The campaign variable was more difficult because none of the transformations like the log, square root, and cube root transformations normalized the distribution. So, instead the variable was grouped into categories to get rid of the sparse values. Both variables pdays and previous were 80% made up of customers that were not previously contacted, so I grouped both variables with two groups, not contacted or contacted. This was done to get rid of the sparse groups as well.
| Not Previously Contacted (0) | Previously Contacted (1+) |
|---|---|
| 36954 | 8257 |
| Not Previously Contacted (-1) | Previously Contacted one or more days ago |
|---|---|
| 36954 | 8257 |
| Contacted 1 Time | Contacted 2 Times | Contacted 3 Times | Contacted 4 or More Times |
|---|---|---|---|
| 17544 | 12505 | 5521 | 9641 |
Next, pairwise comparison was performed with a pairwise scatterplot for each numeric variable, excluding the newly grouped variables campaign, previous, and pdays. The scatterplots all showed a similar trend with the red curve differing from the blue curve, showing only a small correlation between the compared numerical variables. A low correlation value for each comparison means they should all be included in further subsequential models and algorithms.
Pairwise Plots
None of the categorical variables included any missing values or null values, so next, variables with many different categories were grouped to minimize the number of different levels. Both job and month had over 10 different categories, so month was grouped by season rather then month, and job was split by less specific category titles. One of the new category titles was called unknown/unemployed which included students, retired, unemployed, and unknown. Another was called Blue-collar/Services which included services, housemaid, and technician. The last category was called Business, which included admin., entrepreneur, management, and self-employed.
| Fall | Spring | Summer | Winter |
|---|---|---|---|
| 5287 | 17175 | 18483 | 4266 |
| Blue Collar / Service | Business | Unknown / Unemployed |
|---|---|---|
| 22723 | 17695 | 4793 |
Pairwise Comparison
Pairwise Comparison
When building a model for the newly engineered data set, we decided to use a logistic regression approach due to the binary response variable. Starting off with the full model, this model is built up with all of the possible variables, despite their significance in improving it. After that, we move onto the reduced model, which is only made up of significant variables that improve the model. The significant variables are determined by a p-value less then 0.05, meaning that any variable that exceeded this p-value was not included in the reduced model. Both of these models together were utilized as bounds for the final model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -6.0250432 | 1.0850496 | -5.5527816 | 0.0000000 |
| mar married | -0.1985343 | 0.0598173 | -3.3190110 | 0.0009034 |
| mar single | 0.0643846 | 0.0687361 | 0.9366924 | 0.3489168 |
| grp_jobBus. | 0.1519907 | 0.0449950 | 3.3779429 | 0.0007303 |
| grp_jobUnk./Unemp. | 0.5323866 | 0.0570103 | 9.3384219 | 0.0000000 |
| edu secondary | 0.2250788 | 0.0615388 | 3.6575087 | 0.0002547 |
| edu tertiary | 0.4154247 | 0.0687663 | 6.0411049 | 0.0000000 |
| edu unknown | 0.2909910 | 0.1006219 | 2.8919258 | 0.0038289 |
| def yes | -0.2418033 | 0.2354876 | -1.0268193 | 0.3045055 |
| hous. yes | -0.9128193 | 0.0430473 | -21.2050323 | 0.0000000 |
| loan yes | -0.5008711 | 0.0618657 | -8.0961013 | 0.0000000 |
| cont telephone | -0.0136247 | 0.0743607 | -0.1832243 | 0.8546220 |
| cont unknown | -1.1597910 | 0.0592143 | -19.5863366 | 0.0000000 |
| grp_monSpri | 0.0800941 | 0.0581875 | 1.3764838 | 0.1686719 |
| grp_monSumm | -0.2744039 | 0.0569068 | -4.8219879 | 0.0000014 |
| grp_monWint | -0.2552634 | 0.0691594 | -3.6909409 | 0.0002234 |
| pout other | 0.2874105 | 0.0903786 | 3.1800730 | 0.0014724 |
| pout success | 2.3486974 | 0.0810564 | 28.9760971 | 0.0000000 |
| pout unknown | 1.2522823 | 1.0254055 | 1.2212557 | 0.2219892 |
| grp_pre1+ | 1.4891726 | 1.0244034 | 1.4536973 | 0.1460302 |
| trans_age | -0.1465316 | 0.0838891 | -1.7467291 | 0.0806843 |
| day | -0.0057074 | 0.0022398 | -2.5481238 | 0.0108304 |
| dur_min | 2.1338470 | 0.0323923 | 65.8752105 | 0.0000000 |
| new_bal | 0.0225097 | 0.0033625 | 6.6942925 | 0.0000000 |
| grp_cmpn2 | -0.3786628 | 0.0443444 | -8.5391391 | 0.0000000 |
| grp_cmpn3 | -0.3080132 | 0.0598627 | -5.1453307 | 0.0000003 |
| grp_cmpn4-63 | -0.5046849 | 0.0567881 | -8.8871523 | 0.0000000 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -5.4278622 | 0.1130642 | -48.006885 | 0.0000000 |
| mar married | -0.1824294 | 0.0585650 | -3.114989 | 0.0018395 |
| mar single | 0.1764423 | 0.0616811 | 2.860555 | 0.0042290 |
| grp_jobBus. | 0.1805590 | 0.0442184 | 4.083343 | 0.0000444 |
| grp_jobUnk./Unemp. | 0.5943619 | 0.0547011 | 10.865630 | 0.0000000 |
| edu secondary | 0.2800343 | 0.0598823 | 4.676414 | 0.0000029 |
| edu tertiary | 0.4876643 | 0.0668516 | 7.294724 | 0.0000000 |
| edu unknown | 0.2744635 | 0.0989194 | 2.774617 | 0.0055267 |
| hous. yes | -0.9228536 | 0.0384500 | -24.001414 | 0.0000000 |
| loan yes | -0.5037262 | 0.0608733 | -8.274994 | 0.0000000 |
| pout other | 0.2260754 | 0.0898139 | 2.517154 | 0.0118307 |
| pout success | 2.2696256 | 0.0802112 | 28.295626 | 0.0000000 |
| pout unknown | -0.6519499 | 0.0549714 | -11.859808 | 0.0000000 |
| dur_min | 2.0871488 | 0.0314338 | 66.398144 | 0.0000000 |
| new_bal | 0.0242147 | 0.0032708 | 7.403219 | 0.0000000 |
Using the full and reduced model to build the final model, the final model ended up using 13 of the 16 predicting variables available. The final model did not use the default variable, which indicates whether the customer has credit in a default, the previous variable which indicates how many times the customer was contacted before this campaign, and pdays which indicates how many days since the customer was last contacted before this campaign. This means that these variables were not significantly improving the model when predicting whether the customer subscribed to the term deposit.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -4.5392208 | 0.3516724 | -12.9075249 | 0.0000000 |
| mar married | -0.1962908 | 0.0597952 | -3.2827176 | 0.0010281 |
| mar single | 0.0660461 | 0.0687212 | 0.9610730 | 0.3365155 |
| grp_jobBus. | 0.1520453 | 0.0449911 | 3.3794556 | 0.0007263 |
| grp_jobUnk./Unemp. | 0.5341695 | 0.0569901 | 9.3730185 | 0.0000000 |
| edu secondary | 0.2256504 | 0.0615350 | 3.6670244 | 0.0002454 |
| edu tertiary | 0.4165539 | 0.0687592 | 6.0581558 | 0.0000000 |
| edu unknown | 0.2916336 | 0.1006156 | 2.8984939 | 0.0037496 |
| hous. yes | -0.9117506 | 0.0430296 | -21.1889126 | 0.0000000 |
| loan yes | -0.5027728 | 0.0617813 | -8.1379507 | 0.0000000 |
| cont telephone | -0.0134104 | 0.0743609 | -0.1803422 | 0.8568839 |
| cont unknown | -1.1602671 | 0.0592103 | -19.5956967 | 0.0000000 |
| grp_monSpri | 0.0797174 | 0.0581543 | 1.3707895 | 0.1704406 |
| grp_monSumm | -0.2754371 | 0.0568831 | -4.8421587 | 0.0000013 |
| grp_monWint | -0.2559840 | 0.0691383 | -3.7024937 | 0.0002135 |
| pout other | 0.2868420 | 0.0903877 | 3.1734613 | 0.0015063 |
| pout success | 2.3493013 | 0.0810550 | 28.9840362 | 0.0000000 |
| pout unknown | -0.2369468 | 0.0576687 | -4.1087592 | 0.0000398 |
| trans_age | -0.1471512 | 0.0838811 | -1.7542844 | 0.0793818 |
| day | -0.0057369 | 0.0022398 | -2.5613819 | 0.0104257 |
| dur_min | 2.1337476 | 0.0323876 | 65.8815797 | 0.0000000 |
| new_bal | 0.0227796 | 0.0033506 | 6.7985578 | 0.0000000 |
| grp_cmpn2 | -0.3786729 | 0.0443409 | -8.5400410 | 0.0000000 |
| grp_cmpn3 | -0.3072668 | 0.0598581 | -5.1332530 | 0.0000003 |
| grp_cmpn4-63 | -0.5041658 | 0.0567725 | -8.8804611 | 0.0000000 |
The variables with a negative coefficients have a lesser chance of getting the customer to subscribe to the term deposit when the variable category outcome is true. For example, the marital variable is negative when the customer is married, indicating that when a customer is married, that customer is less likely to subscribe to the term deposit. This goes for customers with housing loans, personal loans, those who were contacted by the telephone or by unknown means, those who were last contacted in the winter and summer, those who are older in age, those contacted multiple times during this campaign, those who were contacted later in the month, and those who had an unknown outcome for the previous campaign.
The variables with a positive coefficient have a greater chance at getting the customer to subscribe to the term deposit when the variable category outcome is true. This includes customers who are single, those with jobs that qualify as business, unemployed, or unknown, those with secondary, tertiary, or unknown education, those contacted in the spring, those with a previous subscription outcome of success or other, those with a higher yearly balance of money, an those who had longer duration of time spent in conversations when contacted. For, example customers who were contacted in this campaign during the spring were more likely to subscribe to the term deposit then those contacted in the default season which was fall.
| trans_age | mar | grp_job | edu | dur_min | new_bal | loan | Pred.Response |
|---|---|---|---|---|---|---|---|
| 3 | married | Bus. | primary | 2 | 100 | no | 0 |
| 4 | single | Unk./Unemp. | tertiary | 4 | 40 | yes | 1 |
We can now use this model to predict the outcome of customers with specific qualities. In this case, the model predicted that a customer who was married, with a job in business, in their twenties, primary education, an average yearly balance of 100 (most likely thousand) euros, no personal loan, and had a conversation for between 5 and 10 minutes to not subscribe to the term deposit with this campaign. In the second case, the model predicted that a customer who was single, with an unknown job or unemployed, in their fifties, tertiary education, an average yearly balance of 40 (most likely thousand) euros, currently with a personal loan, and had a conversation for about 50 minutes long to subscribe to the term deposit with this campaign.
Another method used when building models is a data-driven method. With this method, we use the data and observations to build different models, and in this case, we are using cross-validation, cut-off probability, and an ROC curve to help determine the best model.
After building several models, we want to see how the models perform. To start this process, the data set first needs to be split into two random groups, the training data set and the testing data set. The training data set will be made up of 70% of the observations while the testing data set uses the last 30%. The training data set will be used to help fit the model therefore 70% of the original data is required to give a reliable model. Then after the model is complete, the testing data set will be used for an unbiased evaluation.
The next objective is to try to find the optimal cut-off probability. We take several candidates for the optimal cut-off probability, in this case 20 different values. By using a 5-fold cross validation method to identify the best cut-off point based on the highest accuracy, the results above show the optimal value.
| test.accuracy |
|---|
| 0.8965739 |
Here we have tested the model built from the training data set by using the test data set which is the group of random observations which we set aside before. To test the model, the accuracy is found by plugging in the test data set into the model created by the training data set and calculating how many observations were predicted correctly and dividing that value by the predicted probability test. Because the accuracy for the testing data was about the same as the highest accuracy found with the training data, this shows that the model is not underfitting. Furthermore, this accuracy has a higher value showing us that the model is doing a fairly good job at predecting the outcome.
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.3652459 | 0.9708498 | 0.6365714 | 0.3652459 | 0.4641667 |
The above table provides values used to see how the model is performing. Sensitivity measures the probability of how many costumers were predicted to subscribe divided by the amount of customers who actually did subscribe. Sensitivity is the same as recall, explaining their identical values. Specificity is the probability of those customers who were predicted not to subscribe to the term deposit divided by all of the customers who actually did not subscribe. The sensitivity makes the model look poor in it’s performance with such a low value, but this is not a great way of measuring the performance of the model due to the imbalance between customers who subscribe and not in the overall data set. Over 85% of the observations in the original data set are made of customers who did not subscribe to the term deposit, therefore making it difficult for the model to learn how to discern which customers truly will subscribe. This also explains why the specificity value is so high. A better way to measure the performance of the model is by using the precision percentage. This value calculates the number of customers who were actually subscribed among all the costumers predicted to subscribe, making it a more logical choice for this data set. Finally the F1 value is a way of combining the recall and precision value by simply getting the mean, and this tends to be a popular choice for measuring the model performance as it combines two reliable measuring values. This being said, the value shown above does not show a great performance from the model.
One last way to measure the performance of a model, and also a good way to compare the performance of one model to another which will help us later on when we have more models, is the receiver operating characteristic(ROC) curve. The graph above shows sensitivity and 1 minus specificity plotted together This specificity and sensitivity is not from the test data set like the local metrics we calculated before, but is from the training data set as we are using the ROC curve to compare the model to others and therefore need the global measurements. These measurements can be defined as the false positive rate and the true positive rate being plotted together. The area under the curve value(AUC) represents how much of the data is modeled correctly.The curve shows to be much higher then the 50% dotted line, representing what a 50-5o chance would be. This tells us that the model is doing well, but again, the ROC curve and AUC are used more for when we need to compare competing models, which is what we will be doing later on in the paper.
The neural network method is just another way to build a model for our data set regarding bank marketing. It uses forward propagation, backward propagation, and implementing various weights to achieve a model with minimized error. To start this process, numeric features need to be scaled and categorical features names need to be extracted. The numerical features need to be scaled to make them normalized. The character variables need to be given dummy variables simply because the method we are using only takes numerical features. When the categorical features are turned into numerical features by using dummy variables, each new dummy variable is renamed and put into a simple model as seen below.
## y ~ JobBus + JobUnkUnemp + Edu2 + Edu3 + EduUnk + MonSpr + MonSumm +
## MonWint + pd0 + cmpn2 + cmpn3 + cmpn4to63 + pre1plus + marMar +
## marSin + defYes + housyes + loanyes + contTel + contUnk +
## poutOther + poutSuc + poutUnk + age + day + durMin + balance
| error | 475.7406723 |
| reached.threshold | 0.0099814 |
| steps | 22853.0000000 |
| Intercept.to.1layhid1 | 7.4072468 |
| JobBus.to.1layhid1 | -0.1579781 |
| JobUnkUnemp.to.1layhid1 | -0.5988646 |
| Edu2.to.1layhid1 | -0.1124455 |
| Edu3.to.1layhid1 | -0.2588304 |
| EduUnk.to.1layhid1 | -0.2442335 |
| MonSpr.to.1layhid1 | -0.1263705 |
| MonSumm.to.1layhid1 | 0.2841423 |
| MonWint.to.1layhid1 | 0.3840472 |
| pd0.to.1layhid1 | -0.0003824 |
| cmpn2.to.1layhid1 | 0.3661122 |
| cmpn3.to.1layhid1 | 0.3508113 |
| cmpn4to63.to.1layhid1 | 0.4630550 |
| pre1plus.to.1layhid1 | -1.9396286 |
| marMar.to.1layhid1 | 0.0545085 |
| marSin.to.1layhid1 | -0.2318035 |
| defYes.to.1layhid1 | 0.4201786 |
| housyes.to.1layhid1 | 0.9710037 |
| loanyes.to.1layhid1 | 0.2793698 |
| contTel.to.1layhid1 | -0.2454108 |
| contUnk.to.1layhid1 | 1.0500467 |
| poutOther.to.1layhid1 | -0.7210009 |
| poutSuc.to.1layhid1 | -3.2451069 |
| poutUnk.to.1layhid1 | -2.0854551 |
| age.to.1layhid1 | -0.1261020 |
| day.to.1layhid1 | 0.3751998 |
| durMin.to.1layhid1 | -10.5611026 |
| balance.to.1layhid1 | -1.2701069 |
| Intercept.to.y | 0.7727309 |
| 1layhid1.to.y | -0.7804462 |
Single Layer Neural Network Model
By the use of putting the new simple model through back propagation, we can get it’s associated weights and now build the neural network model. The next step after building a model with the neural network method is to use cross-validation to get the optimal cut-off score.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -6.462 | 1.036 | -6.235 | 4.528e-10 |
| grp_jobBus. | 0.152 | 0.045 | 3.378 | 0.0007303 |
| grp_jobUnk./Unemp. | 0.5324 | 0.05701 | 9.338 | 9.778e-21 |
| edu secondary | 0.2251 | 0.06154 | 3.658 | 0.0002547 |
| edu tertiary | 0.4154 | 0.06877 | 6.041 | 1.531e-09 |
| edu unknown | 0.291 | 0.1006 | 2.892 | 0.003829 |
| grp_monSpri | 0.08009 | 0.05819 | 1.376 | 0.1687 |
| grp_monSumm | -0.2744 | 0.05691 | -4.822 | 1.421e-06 |
| grp_monWint | -0.2553 | 0.06916 | -3.691 | 0.0002234 |
| grp_pd1+ | 1.489 | 1.024 | 1.454 | 0.146 |
| grp_cmpn2 | -0.3787 | 0.04434 | -8.539 | 1.352e-17 |
| grp_cmpn3 | -0.308 | 0.05986 | -5.145 | 2.67e-07 |
| grp_cmpn4-63 | -0.5047 | 0.05679 | -8.887 | 6.27e-19 |
| mar married | -0.1985 | 0.05982 | -3.319 | 0.0009034 |
| mar single | 0.06438 | 0.06874 | 0.9367 | 0.3489 |
| def yes | -0.2418 | 0.2355 | -1.027 | 0.3045 |
| hous. yes | -0.9128 | 0.04305 | -21.21 | 8.581e-100 |
| loan yes | -0.5009 | 0.06187 | -8.096 | 5.675e-16 |
| cont telephone | -0.01362 | 0.07436 | -0.1832 | 0.8546 |
| cont unknown | -1.16 | 0.05921 | -19.59 | 2.022e-85 |
| pout other | 0.2874 | 0.09038 | 3.18 | 0.001472 |
| pout success | 2.349 | 0.08106 | 28.98 | 1.317e-184 |
| pout unknown | 1.252 | 1.025 | 1.221 | 0.222 |
| trans_age.scaled | -0.2374 | 0.1359 | -1.747 | 0.08068 |
| day.scaled | -0.1712 | 0.0672 | -2.548 | 0.01083 |
| dur_min.scaled | 9.428 | 0.1431 | 65.88 | 0 |
| new_bal.scaled | 1.052 | 0.1572 | 6.694 | 2.167e-11 |
Neural Network Feature Weights
As seen in the 5-fold cross validation performance graph above, there are several suggested optimal cut-off points. In our case, we will use the mean value among the reported scores. Now that we have a optimal cut-off score, we can find the test the model performance by the accuracy rate.
| accuracy |
|---|
| 0.9912447 |
The test accuracy of about 99% shows that the model is performing very well, although when we see percentages this high, knowing there is an imbalance between the customers who subscribed and not, this can also be a warning sign that the model isn’t learning enough to predict the customers who subscribe well.
Here we have the ROC curve obtained using the optimal cut-off score, sensitivity and the specificity minus 1. The curve shows to be higher than the 50-50 line meaning that the model still performs better then just a coin flip. Now that we have the ROC curve for the neural network model, we can compare it to other models too.