Introduction

Hello, world!

As a part of the analytical team, my task was to build predictive models and assess their performance. In this report, I will do my best to explain in detail what exactly I was doing while coming to my conclusions.

First of all, I am loading the initial dataset. In our teamwork, we are using the cleaned version of Portuguese bank marketing data, so I am deploying this one, as well. Let’s look at its structure very briefly.

library(dplyr)
library(readr)
bank <- read_csv("bank_cleaned.csv")
str(bank)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 40841 obs. of  18 variables:
##  $ X1             : num  0 1 2 5 6 7 8 9 10 11 ...
##  $ age            : num  58 44 33 35 28 42 58 43 41 29 ...
##  $ job            : chr  "management" "technician" "entrepreneur" "management" ...
##  $ marital        : chr  "married" "single" "married" "married" ...
##  $ education      : chr  "tertiary" "secondary" "secondary" "tertiary" ...
##  $ default        : chr  "no" "no" "no" "no" ...
##  $ balance        : num  2143 29 2 231 447 ...
##  $ housing        : chr  "yes" "yes" "yes" "yes" ...
##  $ loan           : chr  "no" "no" "yes" "no" ...
##  $ day            : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ month          : chr  "may" "may" "may" "may" ...
##  $ duration       : num  4.35 2.52 1.27 2.32 3.62 6.33 0.83 0.92 3.7 2.28 ...
##  $ campaign       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays          : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome       : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ response       : chr  "no" "no" "no" "no" ...
##  $ response_binary: num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   age = col_double(),
##   ..   job = col_character(),
##   ..   marital = col_character(),
##   ..   education = col_character(),
##   ..   default = col_character(),
##   ..   balance = col_double(),
##   ..   housing = col_character(),
##   ..   loan = col_character(),
##   ..   day = col_double(),
##   ..   month = col_character(),
##   ..   duration = col_double(),
##   ..   campaign = col_double(),
##   ..   pdays = col_double(),
##   ..   previous = col_double(),
##   ..   poutcome = col_character(),
##   ..   response = col_character(),
##   ..   response_binary = col_double()
##   .. )

Preparing the data

The dataset consists of 18 variables, one of which is a column that enumerates each row in the dataframe and other two are differently coded versions of the outcome I need to predict - that is, whether a client subscribed to a term deposit or not. As I don’t need both versions, I need to choose which one I am going to use. Having decided on the one that includes ‘yes’ or ‘no’ answers, I am dropping binary numeric one. The enumerating column is also dropped, bearing no purpose for the analysis. At last, I transform all categorical variables, including the outcome, from character class to factor class, the purpose of which I am explaining a bit later.

bank$response <- as.factor(bank$response)
bank$poutcome <- as.factor(bank$poutcome)
bank$education <- as.factor(bank$education)
bank$job <- as.factor(bank$job)
bank$default <- as.factor(bank$default)
bank$loan <- as.factor(bank$loan)
bank$housing <- as.factor(bank$housing)
bank$month <- as.factor(bank$month)
bank$marital <- as.factor(bank$marital)
bank$day <- as.factor(bank$day)

bank_drop <- bank %>% select(-X1, - response_binary)

As the client demands, I need to build two types of predictive models, a decision tree and a Bayesian network. Let me start with a tree and then proceed to a Bayesian network. A decision tree is going to be of classification type, as the outcome variable is categorical. Yet, due to technical peculiarities of the tree’s algorithm, it doesn’t take character class variables as input, that’s why I converted all characters to factors a little earlier.

Now, let’s proceed to model building. I start with partitioning the dataset into two parts, a training one that I am going to use for building a model, and a test one that is going to be used for assessing the model’s accuracy.

library(caret)
set.seed(1011180) 
ind = createDataPartition(bank_drop$response, p = 0.20, list = F) 
bank.test = bank_drop[ind,] 
bank.train = bank_drop[-ind,]

Let’s take a look at a number of observations in each class of the outcome variable.

library(ggplot2)
ggplot(bank.train, aes(x=response)) + geom_bar(col = "black", fill = "coral") + theme_bw() + xlab("If a client subscribed a term deposit") + ylab("Number of observations") + labs(title = "Distribution of observations in the outcome variable", subtitle = "Training set") + geom_text(aes(label=..count..), stat = "count", position=position_dodge(width=0.9), vjust=-0.25)

ggplot(bank.test, aes(x=response)) + geom_bar(col = "black", fill = "coral") + theme_bw() + xlab("If a client subscribed a term deposit") + ylab("Number of observations") + labs(title = "Distribution of observations in the outcome variable", subtitle = "Test set")  + geom_text(aes(label=..count..), stat = "count", position=position_dodge(width=0.9), vjust=-0.25)

It can be seen that for training set I now have about 29k of observations for ‘no’ answer and about 4k for ‘yes’ answer. In the test set, there are approximately 7.5k of observations in the ‘no’ category and almost 1k - in the ‘yes’ category. Obviously, there is a serious class imbalance.

Decision tree model

After the data was partitioned, the tree can be constructed. It should be stated that as an additional measure I include a 10-fold cross-validation parameter, just to make sure the resulting model doesn’t overfit. Also, as you can probably see, I am getting rid of the day variable in the model’s formula. I’m doing this on purpose, having found out that with day included, the algorithm refuses to work, as the variable contains 31 values (or levels, as it is deployed as a factor).

bank_tree3 <- train(response~.-day, data = bank.train, trControl=trainControl(method="cv", number=10), method="rpart")

When a model is built, we can take a look at predictors that were used in the final model, i.e. those which the decision tree decided to keep for the final prediction. By plotting the tree itself, we can see those. It appears these features are last contact duration and successful outcome of the previous marketing campaign. Seems a bit odd there are so few of them, but here it goes.

library(rpart.plot)
rpart.plot(bank_tree3$finalModel)

However, let’s suppose that more features are valid to contribute to the tree’s prediction. Let’s extract variable importance from the model. The resulting table, as well as a lollipop-like chart, suggests that 9 features have some influence on the outcome variable, as their overall importance, calculated via the ROC curve analysis, differs from zero. Still, the most important features are the successful outcome of the previous campaign and the last contact duration.

varImp_tree3 <- varImp(bank_tree3, scale = F)
varImp_tree3
## rpart variable importance
## 
##   only 20 most important variables shown (out of 37)
## 
##                    Overall
## poutcomesuccess   1254.911
## duration          1062.040
## pdays              404.385
## previous           397.019
## poutcomeunknown    395.681
## age                137.633
## housingyes          14.267
## monthmay            10.736
## campaign             8.491
## jobstudent           0.000
## jobtechnician        0.000
## monthsep             0.000
## monthjun             0.000
## balance              0.000
## maritalsingle        0.000
## monthdec             0.000
## defaultyes           0.000
## jobretired           0.000
## monthjul             0.000
## educationtertiary    0.000
plot(varImp_tree3, top = 15)

But that was about the model structure. Now let’s turn to the model’s performance. To do that, I am calling for the confusion matrix and statistics of the model. As it can be seen below, the model performs quite good, producing an accuracy of 0.9 on the part of the dataset, it has never seen, i.e. the test set. Which means that it predicts accurately 90% of observations. Also, if we look at the confusion matrix itself, we see that 7035 observations of ‘no’ class were predicted correctly, while 607 of observations of the same class were mismatched. As for ‘yes’ class, 321 were guessed correctly by the model, while 206 were mismatched. In general, it can be stated that the model deals much worse with observations of class ‘yes’, as almost half of them were mismatched. But this problem can be completely attributed to the fact that this class is poorly represented in the sample, compared to the ‘no’ class - as the model deals fantastically with its prediction on ‘no’ class.

library(e1071)
bank_pred3 = predict(bank_tree3, bank.test, type="raw")
confusionMatrix(table(bank_pred3, bank.test$response))
## Confusion Matrix and Statistics
## 
##           
## bank_pred3   no  yes
##        no  7035  607
##        yes  206  321
##                                           
##                Accuracy : 0.9005          
##                  95% CI : (0.8938, 0.9069)
##     No Information Rate : 0.8864          
##     P-Value [Acc > NIR] : 2.419e-05       
##                                           
##                   Kappa : 0.3911          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9716          
##             Specificity : 0.3459          
##          Pos Pred Value : 0.9206          
##          Neg Pred Value : 0.6091          
##              Prevalence : 0.8864          
##          Detection Rate : 0.8612          
##    Detection Prevalence : 0.9355          
##       Balanced Accuracy : 0.6587          
##                                           
##        'Positive' Class : no              
## 

Intermediary conclusions

To conclude with this part, the hypotheses according to the decision tree model are the following. The fact of a client subscribing a term deposit is primarily associated with successful outcome of the previous campaign and duration of the last contact, and secondarily with number of days that passed by after the client was last contacted from a previous campaign, number of contacts performed before this campaign and for this client, if a client has housing loan, last contact month of year, and number of contacts performed during this campaign and for this client.

Bayesian network model

First of all, let’s try to build this model including all variables we have in the dataset, omiting binary numeric response, of course. Due to some difficulties bayesian network models face when trying to deal with numeric variables, I transform all numeric variables into categorical, using discretization.

library(bnlearn) 
library(Rgraphviz)

bank_break <- bank %>% select(-X1, -response_binary) %>% discretize(breaks=4, method="interval")

Now, the Bayesian network model’s structure can be built from scratch, because I have no idea how the features are linked together and related to the outcome variable. I rely on the algorithm to define the structure of the network for me.

bank_learned = hc(bank_break)
graphviz.plot(bank_learned, shape = "ellipse")

In the resulting visualization it comes clear that it is very messed up - several variables are misplaced. The outcome variable, response - for sure, as it is not supposed to influence the duration variable. The value of the latter had been recorded before the results of subscriptions to a term deposit. Impact of job on education suggested by this structure is also suspicious, like it should go in the oposite direction. Besides, the relation between the month variable and the day variable should probably be reversed, as well. Going further, it seems logical to me that whether a client has a default credit or not, the default variable, influences client’s balance and whether (s-)he has a personal loan or not, not vice versa.Finally, age should be exogenous variable, it cannot be influenced by anything. Let’s fix those relationships by specifying origin variables and target variables to them.

bank_learnedRev = hc(bank_break, whitelist = data.frame(from=c("default", "default", "loan", "balance", "duration", "education", "month"), to=c("loan", "balance", "month", "month", "response", "job", "day")))

graphviz.plot(bank_learnedRev, shape = "ellipse")

The relationships that were problematic before are now good. But I’ve got another issues with variables’ relations. Month of the last contact influences education, which seems kinda wrong. Age is endogenous variable, again, being affected by education and job of a client.

Having such unsuccessful attempts of building a reasonable network, I’ve found it logical to use the hypotheses resulted from what the decision tree had built. So I included only those features from the dataset whose overall variable importance detected by the decision tree appeared to be different from zero. They are coded as poutcome, duration, pdays, previous, age, housing, month, and campaign.

bank_bn <- bank %>% select(response, poutcome, duration, pdays, previous, age, housing, month, campaign)
bank_break1 <- discretize(bank_bn, breaks = 4, method = "interval")

bank_learned1 <- hc(bank_break1)
graphviz.plot(bank_learned1, shape = "ellipse")

Well, the same problem with response-duration relation occurs again. Luckily, it is the only one that concerns me in this structure. Let’s try to fix it once more, indicating the target variable, response.

bank_learned1Rev = hc(bank_break1, 
             whitelist = data.frame(from = "duration", to = "response"))

graphviz.plot(bank_learned1Rev, shape = "ellipse")

Okay, here we go. Now it looks much better. Let me try to describe it. According to the resulting structure:

  • age of the client is a confounding variable - it influences two variables at the same time, whether a client has a housing loan and the last contact month. Although the first one mentioned is reasonably placed, inferring that the younger a client is the more likely (s-)he is to have housing loan, considering age from 18 to 25-27 (due to money shortage, I guess), the latter one seems a bit weird. But the fact that age is exogenous in this structure is pleasing.

  • housing is a confounder variable as well, yet, it is endogenous being influenced by age. Housing influences the outcome variable which can be explained as follows - if a client has a housing loan, (s-)he probably doesn’t have spare money to put it on the deposit, any extra profit (s-)he gets goes for loan payments. The relation of housing to last contact month (month) is doubtfull. While housing variable’s influence on the outcome of the previous campaign (poutcome) can be understood as if the bank hadn’t considered holders of housing loans in its previous campaign and that’s why the latter might has been unsuccessful.

  • month is a confounder and an endogenous variable. It is influenced by housing and age while affecting success of the previous campaign (poutcome), number of contacts performed during this campaign and for this client (campaign), and number of days that passed by after the client was last contacted from a previous campaign (pdays). The latter has the very obvious temporal causal relation, I think. Relation of month variable to campaign and pdays can be attributed to the peculiarities of the previous campaign.

  • poutcome is an endogenous variable influenced by month and housing variables. It is also a confounder having an impact on the outcome variable, pdays variable, and the number of contacts performed before this campaign and for this client (previous). The relation of the outcome of the previous bank campaign to a subscription to a term deposit is very reasonable and positively directed, meaning that if a poutcome is “successful”, the outcome will probably be “yes”. Note that the relation of high influence was observed in the decision tree model likewise. While the relation poutcome-pdays doesn’t seem reasonable.

  • campaign is endogenous, it depends on month variable. It is also one of the terminal nodes in the network, having supposedly no relation to the outcome variable.

  • duration is an exogenous variable, having a direct influence on the outcome, displaying the pattern similar to the results of the decision tree model. Funny enough even the page with variables description, where the data were got from, states that duration has a large effect on the outcome variable.

  • previous is endogenous, being influenced by poutcome. It is also a terminal node meaning it has no impact on whether a client subscribes to a term deposit or not.

  • pdays is endogenous and terminal. According to the structure of this Bayesian network, it has no influence on the outcome variable.

In general, the model is logically valid. Besides, compared to the tree model, in the network we can also see that duration variable directly influences the outcome variable. Yet, whereas in the tree the duration variable can affect the response variable through the poutcome variable, there is no such pattern in the network.

Now, the Bayesian network model can be fitted given the abovementioned structure.

bank_fitted = bn.fit(bank_learned1Rev, data = bank_break1)

If the resulting bn.fit object is printed, the whole canvas of conditional probabilities tables for each feature is displayed. As there are 8 predicting variables, each of which has from 2 to 12 categories, I am not going to call for it. The utility of that piece would be going towards zero as it is way too huge. Instead, let’s look at separated CPTs for each feature and its conditional relation to the outcome variable.

duration

It seems like if the bank wants its clients to subscribe to a term deposit, the duration of contact is better to be more than 20 seconds but less than a minute.

library(gRain)
BBNnet <- as.grain(bank_fitted)
querygrain(BBNnet, nodes=c("response", "duration"), type="conditional")
##                response
## duration               no       yes
##   [0.0181,20.6] 0.8922868 0.1077132
##   (20.6,41]     0.3739837 0.6260163
##   (41,61.5]     0.4242188 0.5757812
##   (61.5,82.1]   0.7332705 0.2667295

age

Then, if the bank wants its clients to subscribe a term deposit, it should target at older clients. Yet, the probability of a client subscribing to a term deposit based on the age variable solely is still small.

querygrain(BBNnet, nodes=c("response", "age"), type="conditional")
##              response
## age                  no       yes
##   [17.9,37.2] 0.8905386 0.1094614
##   (37.2,56.5] 0.8885657 0.1114343
##   (56.5,75.8] 0.8582336 0.1417664
##   (75.8,95.1] 0.8005698 0.1994302

campaign

If the bank wants its clients to subscribe a term deposit, the number of contacts performed during this campaign and for this client should be less than 29. Still, the probability of a client subscribing to a term deposit based on the campaign variable solely is small.

querygrain(BBNnet, nodes=c("response", "campaign"), type="conditional")
##               response
## campaign              no        yes
##   [0.943,15.2] 0.8864982 0.11350181
##   (15.2,29.5]  0.8868090 0.11319104
##   (29.5,43.8]  0.8940054 0.10599463
##   (43.8,58.1]  0.9026412 0.09735884

poutcome

If the bank wants its clients to subscribe a term deposit, the previous marketing campaign is totally better to be successful.

querygrain(BBNnet, nodes=c("response", "poutcome"), type="conditional")
##          response
## poutcome         no        yes
##   failure 0.8748473 0.12515268
##   success 0.3541194 0.64588062
##   unknown 0.9093514 0.09064863

previous

If the bank wants its clients to subscribe a term deposit, number of contacts performed before this campaign and for this client is better to be in the interval from 13.8 to 27.5, inclusively. But the probabilities of the instances of the outcome variable happening given any of the intervals of the previous variable don’t differ much.

querygrain(BBNnet, nodes=c("response", "previous"), type="conditional")
##                response
## previous               no       yes
##   [-0.055,13.8] 0.8866498 0.1133502
##   (13.8,27.5]   0.8065551 0.1934449
##   (27.5,41.2]   0.8227745 0.1772255
##   (41.2,55.1]   0.8748473 0.1251527

pdays

If the bank wants its clients to subscribe a term deposit, number of days that passed by after the client was last contacted from a previous campaign shouldn’t be less than 435 or more than 653. That’s is actually contradictory to my previous inferrings. It looks like clients subscribing a term deposit don’t like to be bothered by numerous calls from the bank.

querygrain(BBNnet, nodes=c("response", "pdays"), type="conditional")
##              response
## pdays                no       yes
##   [-1.87,217] 0.8896056 0.1103944
##   (217,435]   0.8481003 0.1518997
##   (435,653]   0.7151452 0.2848548
##   (653,872]   0.8205680 0.1794320

housing

If the bank wants its clients to subscribe a term deposit, it should target at clients without housing loans. Yet, clients’ answer to term deposits would probably still be no, as the difference in conditional probabilitites among housing variable’s categories is not big.

querygrain(BBNnet, nodes=c("response", "housing"), type="conditional")
##        response
## housing        no        yes
##     no  0.8387720 0.16122796
##     yes 0.9242159 0.07578415

month

If the bank wants its clients to subscribe a term deposit, it is better to contact their clients for the last time in December, March, October, or September.

querygrain(BBNnet, nodes=c("response", "month"), type="conditional")
##      response
## month        no        yes
##   apr 0.8846295 0.11537054
##   aug 0.8668248 0.13317518
##   dec 0.7140620 0.28593803
##   feb 0.8597739 0.14022614
##   jan 0.8595534 0.14044659
##   jul 0.8973635 0.10263649
##   jun 0.8879112 0.11208884
##   mar 0.8048301 0.19516987
##   may 0.9159796 0.08402041
##   nov 0.8794731 0.12052687
##   oct 0.7662570 0.23374304
##   sep 0.7276266 0.27237343

The direct significance of each feature for the outcome variable is difficult to estimate with the help of this model. Perhaps because Bayesian network models were never meant to be used for such purposes, reflecting conditional probabilities of the event happening. So what can be done is the inference to the probability of one of the instances of the outcome variable occurring given some instance of a predictor variable or combination of those happening.

Final conclusions

I personally think that both models considered deserve to be included in an analytical report of such kind and cannot be severely compared. They show their power of business analysis in a bit different area of examination. Decision tree appeared to be more suitable for prediction, and can actually be used to predict, based on the new data but the same features, whether a client subscribes to a term deposit or not. Bayesian network, on the other hand, comes quite helpful when a more detailed analysis begins and it is required to try several combinations of variable interactions - the thing that cannot be adjusted or observed in a decision tree. Precisely, a Bayesian network model makes a foundation for further what-if analysis. And that’s what is coming right after my part of the story.

That’s all, thank you for bearing with me!