STEP 1. Analyzing the data

From the bar plot taken from the EDA (Anya’s part) we can see that the number of clients who subscribed a term deposit (36,202 people) is much bigger than the number of those who did not (4,639 people). It means that my predictions might be not precise enough because of a very small number of observation in the sample of non-subscribers.

STEP 2. Building predictive model

Logistic Regression Model

Firstly, I decided to build a logistic regression model. Initially, I put all the variables except response_binary into the formula to see what variables have a significant effect on response. Turned out that variables age, default, day, and pdays do not have significant effect on the response of a customer. I decided to refer to the EDA again to understand the connections between those variables and response.

From the boxplot below we can see that there is almost no difference in age between clients who subscribed our term deposit and those who did not. Their overall mean age is about 37-38 years old.

The barplot below shows the difference between the shares of our subscribers and non-subscribers according to the condition whether they have a credit in default or they have not. We can see that there are about 12% of the subscribers who do not have a credit in default and about 6% of those who have. The number halved and I personally think that this variable could influence the resonse. For example, if a person has credit in default he/she is less likely to subscribe our term deposit because he/she has to pay this credit and is not interested in any new offers. Actually, the plot shows this relationship but for model it is not enough.

The boxplot below shows that there is not difference between clients who subscribed our term deposit and those who did not in the day of the week when they were las contacted. For both of this groups, the mean day is 15 and it can be explained by the fact that the contact between bank and a client can happen in any random day of the week. As there are overall 30-31 days in each month the mean day is 15th one. Seems like this variable really cannot predict the response.

The last boxplot illustrates the number of days that passed since a client was last contacted from a previous campaign for both subscribers and non-subscribers. Actually, there is a small difference of about 30-40 days but even logically I cannot find reasons why this variable culd be connected to the responses of our clients. In addition, there are many outliers on the plot because the most of the clients were last contacted no more than 300 days ago.

Thus, I built a logistic regression model based on all the variables except the already mentioned 4 variables.

## 
## Call:
## glm(formula = response ~ job + marital + education + duration + 
##     poutcome + balance + housing + loan + month + campaign + 
##     previous, family = binomial, data = bank)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7772  -0.3653  -0.2538  -0.1768   3.1691  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.275e+00  1.331e-01 -17.090  < 2e-16 ***
## jobblue-collar     -3.294e-01  7.647e-02  -4.308 1.64e-05 ***
## jobentrepreneur    -4.038e-01  1.324e-01  -3.051 0.002284 ** 
## jobhousemaid       -5.203e-01  1.426e-01  -3.649 0.000263 ***
## jobmanagement      -2.241e-01  7.792e-02  -2.876 0.004026 ** 
## jobother           -7.617e-01  3.338e-01  -2.282 0.022485 *  
## jobretired          2.514e-01  9.224e-02   2.726 0.006410 ** 
## jobself-employed   -3.968e-01  1.201e-01  -3.304 0.000953 ***
## jobservices        -2.317e-01  8.811e-02  -2.629 0.008555 ** 
## jobstudent          4.692e-01  1.217e-01   3.855 0.000116 ***
## jobtechnician      -1.914e-01  7.251e-02  -2.640 0.008300 ** 
## jobunemployed      -1.692e-01  1.165e-01  -1.453 0.146241    
## maritalmarried     -2.034e-01  6.172e-02  -3.296 0.000982 ***
## maritalsingle       1.262e-01  6.623e-02   1.906 0.056696 .  
## educationsecondary  2.124e-01  6.673e-02   3.183 0.001456 ** 
## educationtertiary   4.601e-01  7.792e-02   5.905 3.52e-09 ***
## duration            2.518e-01  4.051e-03  62.157  < 2e-16 ***
## poutcomesuccess     2.286e+00  8.336e-02  27.426  < 2e-16 ***
## poutcomeunknown    -2.691e-01  7.119e-02  -3.780 0.000157 ***
## balance             3.693e-05  1.054e-05   3.505 0.000456 ***
## housingyes         -7.434e-01  4.592e-02 -16.188  < 2e-16 ***
## loanyes            -4.200e-01  6.201e-02  -6.773 1.26e-11 ***
## monthaug           -7.769e-01  8.366e-02  -9.287  < 2e-16 ***
## monthdec            5.447e-01  1.962e-01   2.776 0.005509 ** 
## monthfeb           -2.513e-01  9.146e-02  -2.748 0.005997 ** 
## monthjan           -1.220e+00  1.313e-01  -9.298  < 2e-16 ***
## monthjul           -8.953e-01  8.321e-02 -10.760  < 2e-16 ***
## monthjun           -6.811e-01  8.469e-02  -8.042 8.81e-16 ***
## monthmar            1.655e+00  1.297e-01  12.761  < 2e-16 ***
## monthmay           -1.042e+00  7.416e-02 -14.055  < 2e-16 ***
## monthnov           -8.843e-01  9.075e-02  -9.744  < 2e-16 ***
## monthoct            7.976e-01  1.183e-01   6.743 1.55e-11 ***
## monthsep            7.151e-01  1.323e-01   5.404 6.50e-08 ***
## campaign           -8.833e-02  1.056e-02  -8.368  < 2e-16 ***
## previous            3.126e-02  1.292e-02   2.420 0.015518 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28911  on 40840  degrees of freedom
## Residual deviance: 19297  on 40806  degrees of freedom
## AIC: 19367
## 
## Number of Fisher Scoring iterations: 6

From the summary we can see that almost all variables have a significant effect. StepAIC did not drop any of them because the only insignificant variable (jobunemployed) is just a part of the variable job. I will not fall into details of this model because I needed it just to exclude unnecessary variables but I will evaluate it to decide if it is reliable enough.

##     obs
## pred     0    1
##    0 35346 3021
##    1   856 1618
## attr(,"class")
## [1] "confusion.matrix"

By the confusion matrix we can conclude that 35,346 customers were correctly classified as non-subscribers and 1,618 customers were correctly classified as subscribers. Others were misclassified by our model: 3,021 customers subscribed our term deposit while we predicted them not to do it, and 856 customers did not subscribe term deposit but they were predicted to do it.

## [1] 0.9050709

The accuracy of our model is 90% that is very high. Overall, we can conclude that our model is quite accurate but nothing is perfect!

Decision Tree 1

Then I started to build a decision tree. Initially, I decided again just to put all the variables into the formula and look at the result.

So, after my first try the tree was build based only on two variables: duration and poutcome. Let’s firstly understand what is happening on the decision tree itself and then decide if it is representative.

Interpretation

The root node (the highest one) indicates that the overall probability of a customer not subscribing our term deposit is 0.11 or 11%. This node asks whether the duration of the last call with a client was shorter than 8.2 seconds or not. If yes, then the probability of a customer not subscribing a term deposit becomes lower, up to 7%. 88% of our customers had last call lasting less than 8.2 seconds. Then the node ask about the outcome of the previous marketing campaign. 85% of our clients responsed ‘no’ to the previous campaign or their response is unknown. Their probability of not signing a term deposit is 5%. Those people who answered ‘yes’ to the previous campaign (3% of all customers) have the probability of signing equal to 61%. Among these clients, those whose last call went on less than 2.7 seconds have a probability of not signing a term deposit of 31%, others have the probability of signing equal to 74%. For the customers whose duration of the last call was longer than 8.2 seconds, the probability of signing a term deposit is 43%. Overall, there are 12% of such clients. Then the node again asks about the duration of the call. There are 8% of the customers who had it for less than 14 seconds and 4% of those who had it longer. The former have the probability of not signing a term deposit equal to 35% and the latter have the probability of sining a term deposit equal to 59%. All of those 8% of clients responded to the previous marketing campaign ‘no’ or their response is unknown, and their probability of not signing a term deposit is 32%.

Inferences

Firstly, for me it was obvious that if a client agreed on the previous marketing campaign, he/she is more likely to agree on the new one too, and the data suports this proposition. From the barplot below we can see that more than 60% of those clients who answered ‘yes’ to the previous campaign responded ‘yes’ to signing our term deposit. Among those who dsagreed in previous time the share of positive responses is only about 12%. Even though this variable was considered by model to be a significant influencer, I tried to delete it from the formular to see if there are other variables which are not as significant but still can predict if our client will subscribe a term deposit or not. The results of that manipulation will be presented further in the section “Decision Tree 2”.

Secondly, the duration variable became a kind of problematic one. The boxplot below represents the last contact duration (in seconds) for the clients who subscribed a term deposit and for those who have not. Unfortunately, we cannot know the type of these phonecalls. I suppose they were personal calls from a bank worker to a client with a goal to try to convince him/her to agree to sign our term deposit. Thus, I think that the longer the call was the more chances that a worker succeded. Actually, on the boxplot we can observe this pattern. However, in the descripton to the dataset it was written that even though duration highly affects the output, it should not be included in the predictive model to make it realistic. The reasons are 1) the duration of the call is not known before a call itself, so we cannot predict it, and 2) after the call the response is obviously known (source: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#). So, I also tried to build a decision tree without this variable but it became very small. Without duration, decision tree for our model leans only on poutcome and it becomes not very useful.

Interesting fact is that if we delete both poutcome and duration from our model, the decision tree is impossible to build at all. So, t least one of those variables should be present in the formula.

Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7072  610
##        yes  181  305
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8965, 0.9095)
##     No Information Rate : 0.888           
##     P-Value [Acc > NIR] : 4.967e-06       
##                                           
##                   Kappa : 0.3878          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9750          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.9206          
##          Neg Pred Value : 0.6276          
##              Prevalence : 0.8880          
##          Detection Rate : 0.8658          
##    Detection Prevalence : 0.9405          
##       Balanced Accuracy : 0.6542          
##                                           
##        'Positive' Class : no              
## 

From the confusion matrix, we can see that the proportions of true and false negatives and positives stayed the same but the numbers became lower because of the smaller sample. The number of clients who were predicted to sign our term deposit and did it (305 people) is still bigger than the number of those who were predicted to sign but actually did not do it (181). The number of clients who were correctly predicted to be non-subscribes (7072) is still bigger than the number of those who were incorrectly predicted to be non-subscribers (610). We can also see the accuracy of our model which is again equal to 90%.

opporunity COSTS + monetary gain/losses

On the ROC curve a grey line represents our test classifier and a dashed line represents a classifier with no predicitive value. In plain words, to understand how good is our predictive model we should measure the area between these lines. Ideally, it should be close to 1, in the worst case it will be about 0.5 (source: http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/13_ModelEvaluation.html).

## [[1]]
## [1] 0.7457228

I got 0.75 which is acceptable and means that our model is quite good.

Decision Tree 2

Now I will show what we get if we decide to through poutcome away.

Now because I deleted one of the most significant variables, others less significant ones appeared in our decision tree.

Interpretation

The root node (the highest one) still indicates that the overall probability of a customer not subscribing our term deposit is 0.11 or 11%. This node asks whether the duration of the last call with a client was shorter than 8.2 seconds or not. If yes, then the probability of a customer not subscribing a term deposit becomes lower, up to 7%. 88% of our customers had last call lasting less than 8.2 seconds. Then the node asks about the month when the last contact with a customer took place. If it was in March, September, October, or December, the chance that a customer will not sign our term deposit rises up to 45%. There are only 3% of such clients. Among them, people whose duration of the last call was more that 2.2 seconds have the probability of signing the term deposit equal to 58%, for others the probability of not signing is 16%. Those people who had last contact in other months (84% of all customers) have the probability of not signing a term deposit equal to 5%. For the customers whose duration of the last call was longer than 8.2 seconds, the probability of signing a term deposit is 43%. Overall, there are 12% of such clients. Then the node again asks about the duration of the call. There are 4% of the customers who had it for more than 14 seconds and 8% of those who had it shorter. The former have the probability of signing a term deposit equal to 59% and the latter have the probability of not signing a term deposit equal to 35%. Among these clients, for those who have never been contacted by our bank (7% of all) the probability of not signing a term deposit is 32%. Those who have been contacted at least once (only 1% of customers) have the probability of signing equal to 54%. All of the remainig 1% of clients have a housing loan and their probabilty to not sign a term deposit is 40%.

Inferences

I will start from the top of the tree to the bottom. From the barplot below we can see that 4 months were much more successful than others. I mean March, September, October, and December when about 50% of our clients in each of the named months signed our term deposit. December can be popular for deposits because of Christmas holidays when people spend and buy a lot. September and October are the months when students start to study (in Portugal the studying year begins in the middle of September) and probably they are the reason of such a big demand for deposits in these month. Besides, there are a lot of carnavals in Portugal and Easter in April (just after March) which could also make people sign term deposits.

The following histogram illustrates the number of contacts that were performed for a client before our campaign. Precisely, we can observe that most of our clients have never been contacted at all. Obviously, almost all of them did not sign our term deposit.

The logic for housing loan for me is the same as with the credit in default. If a client already have a loan, he/she is less likely to sign a term deposit because he/she has to pay the loan. The barplot below shows this relationship. Only about 7% of our clients who have a housing loan responded ‘yes’ to our campaign, while fro those who do not have the loan this number rises up to almost 20%.

Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7029  604
##        yes  224  311
##                                           
##                Accuracy : 0.8986          
##                  95% CI : (0.8919, 0.9051)
##     No Information Rate : 0.888           
##     P-Value [Acc > NIR] : 0.001058        
##                                           
##                   Kappa : 0.3775          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9691          
##             Specificity : 0.3399          
##          Pos Pred Value : 0.9209          
##          Neg Pred Value : 0.5813          
##              Prevalence : 0.8880          
##          Detection Rate : 0.8606          
##    Detection Prevalence : 0.9345          
##       Balanced Accuracy : 0.6545          
##                                           
##        'Positive' Class : no              
## 

The confusion matrix did not change a lot. The number of clients who were predicted to sign our term deposit and did it became 311 people (comparing to 305 people in Decision Tree 1). The number of those who were predicted to sign but actually did not do it became 224 (comparing to 181). The number of clients who were correctly predicted to be non-subscribes is 7029 now (comparing to 7072). And the number of those who were incorrectly predicted to be non-subscribers is 604 people (comparing to 610). The accuracy is still 90%.

opporunity COSTS + monetary gain/losses

There is no observable difference in the ROC curve compared to the previous model so let’s measure its area between a grey and a dashed line.

## [[1]]
## [1] 0.7449187

It is equal to 0.75 (the same as in the previous model), so our second model is also representative.

STEP 3. Building Bayesian network

Hypotheses about the Subscription Process

TEXT

Bayesian Network

Now we will test our hypotheses with Bayesian network.

?HOUSING <- MONTH? DELETE ?MONTH <- POUTCOME? DELETE

Our Bayesina network is fully based on the both predictive models we discussed before. As now we can build a network without ‘duration’, I deleted it. Also, in the final network there is not variable ‘previous’ because after the model was built it appeared not to be connected to any other variable. Others variables are taken from the decision trees: poutcome, month, housing, and obviously response.

Interpretation

The Bayesian network shows the relationship between these variables. As we can see all of them influence response. However, both ‘month’ and ‘housing’ influence it not only directly but also indirectly through ‘poutcome’. As I have already explained the relationship between housing and response, month and response, and poutcome and response, I will not focus on them again. Now I will try to explain each of the remaining arrows.

Housing -> Poutcome

Housing -> Loan

TEXT

Housing -> Month

TEXT

Loan -> Month

TEXT

Month -> Balance

TEXT

Month -> Poutcome

TEXT

Inferences

TEXT

Our initial all-sample probability of no-response

## [1] 0.8807
## [1] 0.07659499

The same with exact inference

## $response
## response
##        no       yes 
## 0.8864132 0.1135868
## $response
## response
##        no       yes 
## 0.8754303 0.1245697

If we want to look at initial conditional tables

##          response
## poutcome         no        yes
##   failure 0.8754303 0.12456971
##   success 0.3558591 0.64414091
##   unknown 0.9090857 0.09091432

Evaluation

TEXT

STEP 4. Comparing the models

TEXT