From the bar plot taken from the EDA (Anya’s part) we can see that the number of clients who did not subscribe a term deposit (36,202 people) is much bigger than the number of those who did (4,639 people). It means that my predictions might be not precise enough because of a very small number of observation in the sample of subscribers.
Firstly, we need to understand the contribution of each of our variables to the prediction of response. On the chart below, we can see that duration appeared to be the most related to the response variable compared to other ones. Also, our model identified balance, age, day, and poutcome as the most important variables for the predicting the response. Pdays, campaign, housing, previous, and month are less important but I still will look at the reasons why they could appear in the graph.
(Source: https://uc-r.github.io/lime)
From the heatmap below we can observe which features are the most important for predicting the response on the example of first 10 observations. Almost all of these variables have been already illustrated on the previous graph, and duration is again the most influential one. To the chart above, we can add marital variable.
(Source: https://uc-r.github.io/lime)
I suggest to look at the EDA made by Anya again to understand why the model has chosen exactly these variables.
Duration
The duration variable became a kind of problematic one. The boxplot below represents the last contact duration (in seconds) for the clients who subscribed a term deposit and for those who have not. Unfortunately, we cannot know the type of these phonecalls. I suppose they were personal calls from a bank worker to a client with a goal to try to convince him/her to agree to sign our term deposit. Thus, I think that the longer the call was the more chances that a worker succeded. Actually, on the boxplot we can observe this pattern. However, in the descripton to the dataset it was written that even though duration highly affects the output, it should not be included in the predictive model to make it realistic. The reasons are 1) the duration of the call is not known before a call itself, so we cannot predict it, and 2) after the call the response is obviously known (Source: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#).
Balance
From the next boxplot we can see that there is a small difference in balance levels of those customers who signed out term deposit and those who did not. For those who did the balance is a little bit higher meaning that the less money a person has the less likely he/she is to subscribe new deposits. It is also show on the heatmap.
Age
From the boxplot below we can see that there is almost no difference in age between clients who subscribed our term deposit and those who did not. Their overall mean age is about 40 years old, and that is our targeted age group. It explains why on the heatmap one of the most important features was age from 31 to 50.
Day
The boxplot below shows that there is no difference between clients who subscribed our term deposit and those who did not in the day when they were last contacted. For both of these groups, the mean day is 15 and it can be explained by the fact that the contact between the bank and a client can happen on any random day of the month. As there are overall 30-31 days in each month the mean day is 15th one. Seems like this variable cannot really predict the response.
Poutcome
For me it was obvious that if a client agreed on the previous marketing campaign, he/she is more likely to agree on the new one too, and the data supports this proposition. From the barplot below we can see that more than 60% of those clients who answered ‘yes’ to the previous campaign responded ‘yes’ to signing our term deposit. Among those who disagreed last time, the share of positive responses is only about 12%. That is why on the Variable Importance Visualization graph only successful response to previous campaign was considered to be important in predicting the response for this campaign.
Pdays
The boxplot below illustrates the number of days that passed since a client was last contacted from a previous campaign for both subscribers and non-subscribers. Actually, there is a difference of about 30 days and maybe this difference influenced the response.
Campaign
The following histogram illustrates that mostly customers were contacted about 2-3 times during this campaign. Also, EDA results (Anya’s part) showed that the mean number of contacts was almost the same for subscribers and non-subscribers with former being more likely to get slightly more attention by our bank. Probably that is the influence this variable could have on the response.
Housing
The variable housing is also logical for me. If a client already has a loan, he/she is less likely to sign a term deposit because he/she has to pay the loan. The barplot below shows this relationship. Only about 7% of our clients who have a housing loan responded ‘yes’ to our campaign, while for those who do not have the loan this number rises up to almost 20%. On the Variable Importance Visualization graph we can also see that having a housing loan highly influences the response.
Previous
The following histogram illustrates the number of contacts that were performed for a client before our campaign. Precisely, we can observe that most of our clients have never been contacted at all. Obviously, almost all of them did not sign our term deposit. However, from the EDA results (Anya’s part) we know that subscribers were more often contacted during previous campaign and maybe that also influenced their decision.
Month
From the barplot below we can see that 4 months were much more successful than others. I mean March, September, October, and December when about 50% of our clients in each of the named months signed our term deposit.
I tried to understand the reasons for such a big difference and turned out that in these 4 months the number of clients was too low, about ten times lower that in other months. That is why the shares on the barplot above are so big. I believe that this variable appeared in the Variable Importance Visualization chart only for this reason and cannot predict help to predict the response correctly.
## apr aug dec feb jan jul jun mar may nov oct sep
## 2529 5877 173 2258 1183 6520 4853 407 12496 3483 605 457
Marital
The last barplot below shows the difference between the shares of our subscribers and non-subscribers according to their marital status. We can see that single customers are more likely to subscribe our term deposit while married and divorced clients are less likely to. The difference in shares is not that big, and on the heatmap both married and single status were identified as important features.
Firstly, I built a decision tree based on the variables that were identified by both graphs above as important for the prediction of our customers’ response: duration, balance, age, day, poutcome, pdays, campaign, housing, previous, month, and marital. I decided to leave duration variable because without it the decision tree for our model relies only on poutcome and it becomes not very useful. Interesting fact is that if we delete both poutcome and duration from our model, the decision tree is impossible to build at all. So, at least one of those variables should be present in the formula.
So, after my first try the tree was build based only on two variables: duration and poutcome. Let’s firstly understand what is happening on the decision tree itself and then decide if it is representative.
Interpretation
The root node (the highest one) indicates that the overall probability of a customer not subscribing our term deposit is 0.11 or 11%. This node asks whether the duration of the last call with a client was shorter than 8.2 seconds or not. If yes, then the probability of a customer not subscribing a term deposit becomes lower, up to 7%. 88% of our customers had last call lasting less than 8.2 seconds. Then the node asks about the outcome of the previous marketing campaign. 85% of our clients responsed ‘no’ to the previous campaign or their response is unknown. Their probability of not signing a term deposit is 5%. Those people who answered ‘yes’ to the previous campaign (3% of all customers) have the probability of signing equal to 61%. Among these clients, those whose last call went on less than 2.7 seconds have a probability of not signing a term deposit of 31%, others have the probability of signing equal to 74%. For the customers whose duration of the last call was longer than 8.2 seconds, the probability of signing a term deposit is 43%. Overall, there are 12% of such clients. Then the node again asks about the duration of the call. There are 8% of the customers who had it for less than 14 seconds and 4% of those who had it longer. The former have the probability of not signing a term deposit equal to 35% and the latter have the probability of sining a term deposit equal to 59%. All of those 8% of clients responded to the previous marketing campaign ‘no’ or their response is unknown, and their probability of not signing a term deposit is 32%.
Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7138 554
## yes 186 290
##
## Accuracy : 0.9094
## 95% CI : (0.903, 0.9155)
## No Information Rate : 0.8967
## P-Value [Acc > NIR] : 6.426e-05
##
## Kappa : 0.3943
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9746
## Specificity : 0.3436
## Pos Pred Value : 0.9280
## Neg Pred Value : 0.6092
## Prevalence : 0.8967
## Detection Rate : 0.8739
## Detection Prevalence : 0.9417
## Balanced Accuracy : 0.6591
##
## 'Positive' Class : no
##
From the confusion matrix, we can see the proportions of true and false negatives and positives. The number of clients who were predicted to subscribe our term deposit and did it (305 people) is bigger than the number of those who were predicted to subscribe but actually did not do it (181). The number of clients who were correctly predicted to be non-subscribes (7072) is bigger than the number of those who were incorrectly predicted to be non-subscribers (610). We can also see the accuracy of our model which is again equal to 90%. From these numbers we can conclude that our bank works quite efficiently. There are more correct preidictions than incorrect which means that we get more money that spend. Overall we predicted that 486 customers will subscribe our term deposit and only 181 of them did not do it. It means that we spend money on all 486 customers but still we gained them back from 305 clients. For those clients who were predicted to be non-subscribers the situation is even better. 7682 clients were predicted not to sudscribe a deposit, and about 92% did not really do it. Thus, we did not spend money on them and we did not lose anything. Besides, we gained some money from 610 clients who subscribed a term deposit even though we did not expect it.
On the ROC curve a grey line represents the dependence between TPR and FPR for our model and a dashed line represents dependence with no predicitive value. In plain words, to understand how good is our predictive model we should measure the area between these lines. Ideally, it should be close to 1, in the worst case it will be about 0.5 (Source: http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/13_ModelEvaluation.html).
## [[1]]
## [1] 0.7523582
I got 0.75 which is acceptable and means that our model is quite good.
Even though poutcome was considered by the model to be a significant influencer, I tried to delete it from the formula to see if there are other variables which are not as significant but still can predict if our client will subscribe a term deposit or not. The results of that manipulation will be presented further.
Now because I deleted one of the most significant variables, others less significant ones appeared in our decision tree.
Interpretation
The root node (the highest one) still indicates that the overall probability of a customer not subscribing our term deposit is 0.11 or 11%. This node asks whether the duration of the last call with a client was shorter than 8.2 seconds or not. If yes, then the probability of a customer not subscribing a term deposit becomes lower, up to 7%. 88% of our customers had last call lasting less than 8.2 seconds. Then the node asks about the month when the last contact with a customer took place. If it was in March, September, October, or December, the chance that a customer will not sign our term deposit rises up to 45%. There are only 3% of such clients. Among them, people whose duration of the last call was more that 2.2 seconds have the probability of signing the term deposit equal to 58%, for others the probability of not signing is 16%. Those people who had last contact in other months (84% of all customers) have the probability of not signing a term deposit equal to 5%. For the customers whose duration of the last call was longer than 8.2 seconds, the probability of signing a term deposit is 43%. Overall, there are 12% of such clients. Then the node again asks about the duration of the call. There are 4% of the customers who had it for more than 14 seconds and 8% of those who had it shorter. The former have the probability of signing a term deposit equal to 59% and the latter have the probability of not signing a term deposit equal to 35%. Among these clients, for those who have never been contacted by our bank (7% of all) the probability of not signing a term deposit is 32%. Those who have been contacted at least once (only 1% of customers) have the probability of signing equal to 54%. All of the remainig 1% of clients have a housing loan and their probabilty to not sign a term deposit is 40%.
Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7112 576
## yes 212 268
##
## Accuracy : 0.9035
## 95% CI : (0.8969, 0.9098)
## No Information Rate : 0.8967
## P-Value [Acc > NIR] : 0.02104
##
## Kappa : 0.3566
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9711
## Specificity : 0.3175
## Pos Pred Value : 0.9251
## Neg Pred Value : 0.5583
## Prevalence : 0.8967
## Detection Rate : 0.8707
## Detection Prevalence : 0.9412
## Balanced Accuracy : 0.6443
##
## 'Positive' Class : no
##
The confusion matrix did not change a lot. The number of clients who were predicted to sign our term deposit and did it became 311 people (comparing to 305 people in Decision Tree 1). The number of those who were predicted to sign but actually did not do it became 224 (comparing to 181). The number of clients who were correctly predicted to be non-subscribes is 7029 now (comparing to 7072). And the number of those who were incorrectly predicted to be non-subscribers is 604 people (comparing to 610). The accuracy is still 90%. As our proportions did not change much, the conclusion in terms of gains and losses are the same as for the Desicion Tree 1.
There is no observable difference in the ROC curve compared to the previous model so we will again measure its area between a grey and a dashed line.
## [[1]]
## [1] 0.7445086
It is equal to 0.75 (the same as in the previous model), so our second model is also representative.
Even though, on the variable importance stage much more variables appeared to influence response, in decicison trees only 5 of them appeared: duration, poutcome, pdays, housing, and month. However, variables duration and month were classified by me as not representative and I do not think that they can actually predict the response. Thus, I will formulate my hypotheses based on other variables that were used in decision trees.
Now we will test our hypotheses with Bayesian network.
Automatically, the network was built not correctly enough as the arrows should go to the response, not from it. So, I changed it a little bit.
Interpretation
The Bayesian network shows the relationship between these variables. As I have already explained the relationship between housing and response, pdays and response, and poutcome and response on the first stage, I will not focus on them again.
Evaluation
The accuracy of our Bayesian network is equal to 90%. That is the same as for our decision trees.
## [1] 0.89655
The overall probability of a client not subscribing our term deposit according to our Bayesian network is equal to about 72%. Pretty high!
## [1] 0.7189
Now we will look at conditional probabilities tables and we will start with the relationship between poutcome and response.
Poutcome
## response
## poutcome no yes
## failure 0.5505258 0.4494742
## success 0.4862907 0.5137093
## unknown 0.8635499 0.1364501
The first CPT confirms my hypothesis about poutcome. Clients who disagreed on the previous marketing campaign will not subscribe our term deposit with probability of 55%. Those clients who agreed to the previous campaign will subscribe a term deposit with probability of 51%. As for the clients whose responses on the previous campaign are unknown, they are also more likely to be non-subscribers (86% probability).
Housing
## response
## housing no yes
## no 0.7955616 0.2044384
## yes 0.8304874 0.1695126
My hypothesis about housing variable is also supported. The customers who already have a housing loan in our bank are more likely not to subscribe a term deposit with the probability of 83%. Those who do not have it are also more likely to be non-subscribers with the probability of 80%. There is a small difference in decisions of the clients from these two groups but unfortunately both are more likely not to subscribe our term deposit.
Pdays
## response
## pdays no yes
## (-2,0] 0.8491958 0.15080422
## (0,100] 0.2846891 0.71531089
## (100,200] 0.5387544 0.46124558
## (200,300] 0.5413853 0.45861466
## (300,871] 0.9621661 0.03783387
My last hypothesis is also supported by the network. If a client have never been contacted by our bank, the probability that he/she will not subscribe a term deposit is about 85%. Those who were contacted recently (less than 100 days ago) have a probability of subscribing a term deposit equal to 72%. Other clients are less likely to be subscribers, their probability not to subscribe a term deposit is from 54% to 96%.
The number of days since the last contact to the customer was divided in such way mainly to divide those clients who have never been contacted (-2;0] and those who vere contacted at least once (0,871]. From the summary below we can see that the number of clients who have never been contacted is much bigger than the number of those who were. At least, I tried to make other groups equal.
## (-2,0] (0,100] (100,200] (200,300] (300,871]
## 34797 1037 2188 1063 1756
P.S.
Frankly speaking, I expected more variables to have influence on the customers’ decision to subscribe a term deposit or not. That is why I have also tried to build another Bayesian network based on the other variables that were considered to be important for the response on the stage of choosing variables for predictive model and that were cpnsidered to be logical predictors. I put balance, age, campaign, and marital in the network but in CPTs I did not get any significant results. The percentages of ‘yes’ and ‘no’ responses almost did not change between different balance levels, ages, marital statuses, and the number of contacts for a client.
Both our Decision Trees and Bayesian Network showed the same accuracy and predicted the response quite good. Unfortunately, even though Decision Tree 1 showed appropriate statistics, I would not consider it reliable for predicting our subscribers and non-subscribers because it is fully based on two the most obvious variables to predict the response. Decision Tree 2 is also quite biased because of month variable which influences the response only because of the too small number of clients in 4 out of 12 months. But at least it has other variables that could really influence the decision of our customers to subscribe a term deposit or not.
As our Bayesian network was built only on the variables that were included in Decision trees (except duration and month), it predicted the response the same as they did. Successful outcome of the previous campaign in both models predicted subscribing of a term deposit. At the same time, failure or unknown result in the previous campaign was associated with non-subscribing of a term deposit. If a customer has never been contacted by our bank, in both models he/she had a greater probability not to subscribe a term deposit. On the contrary, those who were at least once contacted by the bank were more likely to subscribe a term deposit. The clients who have a housing loan were predcited by both models to be less likely to become subscribers. Those who do not have a housing loan in both models had greater probabilities to subscribe a term deposit.
As other variables did not show any significant influence on the response, I can conclude that both models succeded in predicting the results of the marketing campaign of our bank.
Conclusion
After all the analysis made, I would recommend our bank to take care of clients who already have loans (housing or personal) and offer them attractive deposits with more benefits for them. Other suggestions and policies we would implement you can find in our “Subscription Improvement Policy” part made by Nadya.