Final Business Analytical Report by BM Corporation
Our company, “BMC”, is glad to meet you! Thank you for choosing us for the analysis of your marketing database. We hope to find some interesting information and accurate, implementable results. So, let’s leave the lyrics and get into the data. But first… a short intro :)
Our team “Busy Meatballs Corporation” consists of three highly qualified third-year data analysts:
Now we are ready to start!
First, we would like to present you who is your client. For that we had a look at the socio-demographic data. Here are the results (tables and graphs are placed below):
Marital status: Almost 2/3 of your clients are married people. Another one thrid part is divided between singles (almost 30%) and divorced ones (12%);
Education: More than a half of clients (53.7%) have secondary education, 30% have higher education and 16%, by contrast, have primary education.
Job position: First of all, 88% of your clients are employed somewhere. As for the more detailed information, the most wide-spread job positions among clients are: blue-collar (22%), management (21%) and technician (17%). Then goes administrator position (12%) and service (9%). All other positions have less than 10% of observations and can be seen in the table below.
Age: In general, the age of your clients vary from 18 to 95, having 40 y.o. as an avarage one. From the graph we may see the first increase from 25 y.o. and peak at the point of 30-35 with the following decrease. All in all, I would say that, according to descriptive statistics, 50% of your clients fill the gap between 33 and 48 y.o.
Balance: As for the client’s balance, from the graph we may see that the most frequent balance vary from -500 to 1000-1500 (every bar on the graph is equal to 500). Descriptive statistics says than 50% of our observations lie between 64 and 1333. The avarage client’s balance here is about 1000. Minimal and maximum values can also be seen in the table are -6847 and 10443.
marital count perc
1 married 24641 60.33%
2 single 11443 28.02%
3 divorced 4757 11.65%
education count perc
1 secondary 21933 53.7%
2 tertiary 12380 30.31%
3 primary 6528 15.98%
employed count perc
1 yes 35771 87.59%
2 no 5070 12.41%
job count perc
1 blue-collar 8805 21.56%
2 management 8565 20.97%
3 technician 6954 17.03%
4 admin. 4705 11.52%
5 services 3801 9.31%
6 retired 2020 4.95%
7 self-employed 1443 3.53%
8 entrepreneur 1340 3.28%
9 unemployed 1212 2.97%
10 housemaid 1149 2.81%
11 student 689 1.69%
12 other 158 0.39%
Now let’s have a look at the overall bank-client situation:
Now, let’s compare results of the previous marketing campaign and current one (if the client subscribed a term deposit or not)
Unfortunately, we do not hae enought information about the results of previous marketing campaign: a huge part of outcome is unknown. But we know, that 3% of all cases of previous campaign were successful and 11% were not (which is almost 4 times more). As for the present marketing campaign, we see that 11% of our clients gave positive responses and subscribed a term deposit. Not the best result, BUT it’s not as bad as it could be: in comparison with number of clients, who have credit in default or with known ‘successful’ outcome cases of previuos campaign 11% of positive responses is pretty good.
Okay. After we know who is our client and what are results of both marketing campaigns, let’s look at factors which could relate somehow to the responses to the present campaign. For that I will draw graphs to visualize proportions of clients who did and did not subscribed to a term deposit. Also, I will mention if the differenes between two groups of clients were significant. To understand it I was using statistical tests (chi-square and t-tests). I will not add tables with calculations and residuals here, but if you are interested, you can explore it via the link: http://rpubs.com/Nuta/part1.
Let’s go!
Marital status:
On the graph below we may notice that single people give positive responses more often, than married ones. Statistical test supports the hypothesis about this relation.
Education and job position:
From the graph we see a slight percentage increase of positive responses with the increasing educational level. As for the test, there is some relation: clients with higher education (tertiary) give positive responses and those, who have primary or secondary education, tend to give negative responses to the marketing campaign.
As for the employment level and job position, I would say, that there is such relation that unemployed clients tend to give positive responses to the marketing campaign, while employed clients, in contrast, give negative responses. Here I would like to notice that factor ‘unempoyed’ include students and retired clients. This relation was also supported by the test. Speaking about employed clients, the positive relation can be found only among managerial and administrator positions. Other give either no relation or negative one.
Age and balance:
Not quite evident from the graphs, but there are some differences between avarage age in two groups. Clients, who have subscribed to a term deposit ted to be older and with higher balance on their accounts.
I’d like to see if there any relation between the fact that client has or has not other loans or credit in default and their response to the marketing campaign. The result is the following: if a client is already having some loan or credit, they will probably refuse in subscribing to a term deposit. That’s quite logical. This hypothesis is noticable from the graphs and is supported by tests (the part with housing and personal loan at least).
Also I want to look separatly is there any relation between the results of the previous campaign and present one. On the graph below we can clearly see the prevalence of positive responses among successful cases of the previous campaign. The same relation is supported by the test.
That’s what I was expecting: if the client agrees to do smth in one marketing campaign, there should be a good chance, that they will repeat this success in the next campaign. Maybe, these are some loyal clients or smth of that kind.
Now, I would like to have a look at the relation with bank company’s action. For that I will use:
These part will help us to understand how you can influence or at least if your actions are related to client’s decision at all.
From the graph with months we may clearly see the prevalence of positive responses among the clients, who were last contacted in september, october, march and december. The results of statistical test support this positive relation, adding april and february to the set. Other months, in contrast, give negative relation.
As for the day, we can se the increase of positive responses among people, who were lastly connected in the beginning of the month (1st, 3rd, 4th, 10th days) and some noticable increase on the 30th. If the graph doesn’t give clear information, we can look at the results of the test, there exist some significant relations: connecting with client at the beginning of the month have positive relation with giving positive response to the marketing campaign. 30th day also is positively related to client’s subscription to the term deposit.
There is a difference between duration of the last contact, which can be evidently seen from the graph: among positive responses to the marketing campaign we can see longer contact duration. Maybe that is more or less obvious and explainatory (the person can give a fast negative response, but positove response takes more time… but that’s only my hypothesis). The significant difference between means of these two groups (positive and negative responses) is also supported by the test.
( * - the asterisk mark on the y-axis means that the scale is converted into log10 for normalization and better visualization)
As for the number of contact with clients we have slightly similar results in both cases: defore and after the campaign. And the fact is there’s no big difference seen from the graphs (however t-test says that the difference in means exists and it’s significant). Anyway, here, it’s more interesting to have a look at the third graph with number of day passed by after the last contact. The difference between means an be seen clearly from the graph as well as the whole position of boxes. The same idea, that there is a significant difference between means in two groups (who gave postitive or nagative response to the marketing campaign) is supported be the test. So, we may said that, in our case, the more recent was the last contact, the more chances that the client subsribes to the term deposit. Which is wuite logical: if we forget about our client and don’t contact them, they will forget about our service too and may even churn. So don’t forget your client and they will give positive responses to the marketing campaign. But, again, it’s only hypothesis :)
Now, when exploratory data analysis is done and we’ve accumulated some knowledge about the variables we are going to use in our predictive models and posterior analysis, we can proceed to the model building.
Our outcome is response - we are interested to see which factors can predict will the client give positive or negative response to our marketing campaign. Out of my personal interest I decided to try logistic regression here. Leter, my colleguages will catch me up and continue this story with other methods (Decision tree and Bayesian network models in particular). I will not concentrate here a lot, as later analysis will be more elaborate. So, I’ll try to be short!
First, we divide the sample into two parts: train and test ones in proportion 80/20. Then, make a model. As the predictors I’ve used all the variables, which we have previously explored.
With the results of logistic regression I may say which predictors gave positive relation (meaning that with them client is predicted to give positive response) and negative one. So, as for the positive relation we have:
And negative relation:
Here I don’t what to spread and tell about each predictors, but I really want to highlight the huge positive relation of outcome of previous campaign. As for results of logistic regression, positive response in previous campaign increases the odds of giving positive response in present one by a factor of 9.25 which is 825% compared to someone who gave negative response in previous campaign.
Also, I want to mention, that the results, we see here, support lot’s of our hypothesis in the previous exploralory part.
obs
pred 0 1
0 7065 608
1 176 320
attr(,"class")
[1] "confusion.matrix"
[1] 0.9040274
Talking about model’s performance: it’s pretty good! 90% of accuracy is a great result and as for the confusion matrix we see that 7056 of people were correctly predicted as giving negative response to the present campaign and 320 people were correctly predited to give positove response to the present campaign. There’s also many people were incorrectly presicted as giving negative responses (608), but it can be explained by the fact, that there’s a big prevalege of negative responses at all in the whole sample.
As the client demands, we need to build two types of predictive models, a decision tree and a Bayesian network. Let us start with a tree and then proceed to a network. A decision tree is a machine learning algorithm that basically decides which combination of variables would be a shortest path to reach an outcome, a target variable. Trees can be of either regression, predicting a number as an outcome, or classification, predicting a category of an outcome, type. In our case, the target we want to pedict has two categories, ‘yes’ or ‘no’ answer, so we are left with classification tree. We put all of the variables we have in our dataset to make a prediction whether a client subscribes to a term deposit, a ‘yes’ answer, or not, a ‘no’ answer. However, one of the variables, the last contact day of month, is purposedly left aside, as the tree algorithm has some difficulties dealing with 31 categories that this variable offers. In everything else we are good, so let’s build the model!
Let’s take a look at the picture. It appears that the algorithm considers only two factors to have some influence on our target variable, last contact duration and successful outcome of the previous marketing campaign. Also, according to tree’s structure, successful outcome of the previous campaign influences whether a clients subscribes to a term deposit or not through duration of the last contact. For us, as experienced analysts, it seems a bit odd that the tree displayed only two variables as signifficant for the prediction. We decide to get the overall importance of all variables that algorithm has used in its calculations. Here it is.
On the lollipop chart you can see all of the variables we gave to the model as input, ordered by their overall importance in predicting the outcome. It seems like there are more than two factors capable to contribute to the model’s accurate prediction. Besides already mentioned, those include number of days that passed by after the client was last contacted from a previous campaign, number of contacts performed before this campaign and for this client, if a client has housing loan, last contact month of year, and number of contacts performed during this campaign and for this client. Their lines end with blue caps, meaning that the metric of importance is different from zero.
But that’s all about the model’s structure. Now let’s turn to the model’s performance.
Confusion Matrix and Statistics
bank_pred3 no yes
no 7035 607
yes 206 321
Accuracy : 0.9005
95% CI : (0.8938, 0.9069)
No Information Rate : 0.8864
P-Value [Acc > NIR] : 2.419e-05
Kappa : 0.3911
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9716
Specificity : 0.3459
Pos Pred Value : 0.9206
Neg Pred Value : 0.6091
Prevalence : 0.8864
Detection Rate : 0.8612
Detection Prevalence : 0.9355
Balanced Accuracy : 0.6587
'Positive' Class : no
According to some statistics above, the model performs quite good - it accurately predicts 90% of part of the data that belong to the bank marketing dataset but were intentionally hidden from the model to assess quality of its prediction. However, if we look a bit closer we can realize that this quality of prediction concerns only ‘no’ answers of the target variable that we predicted. Whereas ‘yes’ answers the model predicts almost randomly - only about 55% of its guesses are correct. This situation happened due to the fact that we didn’t have enough instances of ‘yes’ answers in the dataset. The very same result, aqfully identical to say the least, is observed in the assessmet of logistic regression model’s performance. For us it serves as validation of machine learning models’ consistency.
To conclude with this part, the hypotheses according to the decision tree model are the following. The fact of a client subscribing a term deposit is primarily associated with successful outcome of the previous campaign and duration of the last contact, and secondarily with number of days that passed by after the client was last contacted from a previous campaign, number of contacts performed before this campaign and for this client, if a client has housing loan, last contact month of year, and number of contacts performed during this campaign and for this client.
Now we can begin with a Bayesian network model. First of all, let’s try to build this model including all variables we have in the dataset. As we supposedly have no idea of how the variables are linked together and related to the outcome, we rely on the algorithm to define the structure of the network for us
In the resulting visualization it comes clear that some links are wrongly placed. The outcome variable, response - for sure, as it is not supposed to influence the duration variable. The value of the latter had been recorded before the results of subscriptions to a term deposit had come. Impact of job on education suggested by this structure is also suspicious, like it should go in the oposite direction. Besides, the relation between the last contact month of year (month) and the last contact day of month (day) should probably be reversed, as well. te direction. Besides, the relation between the month variable and the day variable should probably be reversed, as well. Going further, it seems logical to me that whether a client has a default credit or not, default, influences client’s balance and whether (s-)he has a personal loan or not, not vice versa. Finally, age cannot be influenced by anything. Let’s try to fix those links.
The relationships that were problematic before are now good. But there are another issues with variables’ relations. Month of the last contact influences education, which seems kinda wrong. Age is still affected by education and job of a client.
Having such unsuccessful attempts of building a reasonable network, we’ve found it justified to use the hypotheses resulted from what the decision tree had built. So we included only those variables from the dataset whose overall importance detected by the decision tree appeared to be different from zero. They are coded as poutcome, duration, pdays, previous, age, housing, month, and campaign.
Well, the same problem with response-duration relation occurs again. Luckily, it is the only one that is concerning in this structure. Let’s try to fix it once more, indicating the reversed relationships.
Okay, here we go. Now it looks much better. Let us try to describe it in detail. According to the resulting structure:
age of the client influences whether a client has a housing loan and the last contact month. Although the first one mentioned is reasonably placed, inferring that the younger a client is the more likely (s-)he is to have housing loan, considering age from 18 to 25-27 (due to money shortage, we assume), the latter one seems a bit weird.
whether a client has a housing loan or not (housing) influences the outcome variable which can be explained as follows - if a client has a housing loan, (s-)he probably doesn’t have spare money to put it on the deposit, any extra profit (s-)he gets goes for loan payments. The relation of housing to last contact month (month) is doubtfull. While housing’s influence on the outcome of the previous campaign (poutcome) can be understood as if the bank hadn’t considered holders of housing loans in its previous campaign and that’s why the latter might has been unsuccessful.
month of the last contact (month) affects success of the previous campaign (poutcome), number of contacts performed during this campaign and for this client (campaign), and number of days that passed by after the client was last contacted from a previous campaign (pdays). The latter has the very obvious temporal causal relation, we think. Relation of month to campaign and pdays can be attributed to the peculiarities of the previous campaign.
success of the previous marketing campaign (poutcome) has an impact on the outcome, pdays, and the number of contacts performed before this campaign and for this client (previous). The relation of poutcome to a subscription to a term deposit (response) is very reasonable and positively directed, meaning that if a poutcome is “successful”, the outcome will probably be “yes”. Note that the relation of high influence was observed in the decision tree model likewise. While the relation poutcome-pdays doesn’t seem logical.
number of contacts performed during this campaign and for this client(campaign) is one of the terminal nodes in the network, having supposedly no relation to the outcome variable.
last contact duration (duration) has a direct influence on the outcome, displaying the pattern similar to the results of the decision tree model.
the number of contacts performed before this campaign and for this client (previous) is also a terminal node meaning it has no impact on whether a client subscribes to a term deposit or not.
number of days that passed by after the client was last contacted from a previous campaign (pdays) is endogenous and terminal. According to the structure of this Bayesian network, it has no influence on the outcome variable.
In general, the model is logically valid. Besides, compared to the tree model, in the network we can also see that duration directly influences the outcome. Yet, whereas in the tree the duration can affect the outcome through the result of the bank’s previous campaign, there is no such pattern in the network.
I personally think that both models considered deserve to be included in an analytical report of such kind and cannot be severely compared. They show their power of business analysis in a bit different area of examination. Decision tree appeared to be more suitable for prediction, and can actually be used to predict, based on the new data but the same features, whether a client subscribes to a term deposit or not. Bayesian network, on the other hand, comes quite helpful when a more detailed analysis begins and it is required to try several combinations of variable interactions - the thing that cannot be adjusted or observed in a decision tree. Precisely, a Bayesian network model makes a foundation for further what-if analysis. And that’s what is coming next.
We have built the Bayesian network, and now we calculate the probability to subscribe a term deposit, given the specific characteristic of a client. The overall probability to subscribe a term deposit for any client in the database is around 12%. The highest probability to subscribe is for those clients whose last campaign was successful (65%)
Term_deposit
Previous_campaign_result no yes
failure 0.8748473 0.12515268
success 0.3541194 0.64588062
unknown 0.9093514 0.09064863
The probability to subscribe is also high for those, whose last contact lasted for 20-40 seconds (60%).
Term_deposit
duration no yes
[0.0181,20.6] 0.8922868 0.1077132
(20.6,41] 0.3739837 0.6260163
(41,61.5] 0.4242188 0.5757812
(61.5,82.1] 0.7332705 0.2667295
All other factors seems not to be very significant. If a client was contacted in December, the probability to subscribe is 29%.
Term_deposit
month no yes
apr 0.8846295 0.11537054
aug 0.8668248 0.13317518
dec 0.7140620 0.28593803
feb 0.8597739 0.14022614
jan 0.8595534 0.14044659
jul 0.8973635 0.10263649
jun 0.8879112 0.11208884
mar 0.8048301 0.19516987
may 0.9159796 0.08402041
nov 0.8794731 0.12052687
oct 0.7662570 0.23374304
sep 0.7276266 0.27237343
If a client was not contacted for 435-653 days, the probability for him to subscribe is also 29%
Term_deposit
Days_since_last_contact no yes
[-1.87,217] 0.8896056 0.1103944
(217,435] 0.8481003 0.1518997
(435,653] 0.7151452 0.2848548
(653,872] 0.8205680 0.1794320
Having all these probabilities in mind, I want to suggest some tips for improvement:
And what will happen if to implement these strategies?
However, the proportion of people who had a successful previous campaign is very low. And here are the things that can be done for the other categories:
Subscription rate before policy (in %)
Successful result 65
Failed result 13
Unknown result 9
Subscription rate after policy (in %)
Successful result 100
Failed result 49
Unknown result 62
So, our team performed a huge analysis for you. We have shown the existing situation in the bank, explored the possible reasons for such a situation and suggested the improvement policies. The main advice is to regulate the calls, making them quite short and not that frequent. We hope that our work will be useful in the improvement of your promotional campaigns. Thank you for your attention and all the best!