library(readr)
library(mlbench)
library(ggplot2)
library(dplyr)
library(caret)
library(MASS)
library(boot)
library(e1071)
library(SDMTools)
library(caTools)
library(car)
library(CatEncoders)
library(caret)
library(kableExtra)
library(partykit)
library(vip)

bank_cleaned <- read_csv("/students/apbakhareva_1/BA/bank_cleaned.csv")

Analysing Subscriptions with Tree Models

Preparing the Data

bank = bank_cleaned
bank$age = as.numeric(as.character(bank$age))
bank$duration = as.numeric(as.character(bank$duration))
bank$balance = as.numeric(as.character(bank$balance))
bank$job = as.factor(bank$job)
bank$marital = as.factor(bank$marital)
bank$education = as.factor(bank$education)
bank$job = as.factor(bank$job)
bank$default = as.factor(bank$default)
bank$housing = as.factor(bank$housing)
bank$loan = as.factor(bank$loan)
bank$month = as.factor(bank$month)
bank$response_binary = as.factor(bank$response_binary)
bank$poutcome = as.factor(bank$poutcome)
bank2 = bank %>% 
  dplyr::select(age, job, marital, education, default, balance, housing, loan, duration, campaign, poutcome, response_binary, previous, pdays, month)

Creating Samples

set.seed(18)

trainIndex <- createDataPartition(bank2$response_binary, p = .8, 
                                  list = FALSE, 
                                  times = 1)
train <- bank2[ trainIndex,]
test  <- bank2[-trainIndex,]

Tree Model #1

cv5<-trainControl(method="cv", number = 5)# кросс-валидация, 5 групп, валидироваться 5 раз
set.seed(18)
tree_model <- caret::train(response_binary~., method = 'ctree', data = train, trControl=cv5)

Tree Model on Train Sample

set.seed(18)
predictions.on.train1 <- predict(tree_model, train)
confusionMatrix(predictions.on.train1, train$response_binary, positive = "1", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28085  2147
##          1   877  1565
##                                           
##                Accuracy : 0.9074          
##                  95% CI : (0.9043, 0.9106)
##     No Information Rate : 0.8864          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4599          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.64087         
##                  Recall : 0.42161         
##                      F1 : 0.50861         
##              Prevalence : 0.11361         
##          Detection Rate : 0.04790         
##    Detection Prevalence : 0.07474         
##       Balanced Accuracy : 0.69566         
##                                           
##        'Positive' Class : 1               
##

Accuracy is really high. Almost 91% of our clients were correctly assigned to the classes: subscribers and non-subscribers.

Next, let’s do a closer analysis of positives and negatives to gain more insight into our model’s performance.

Out of 28962 non-subscribers, almost 97% of clients were correctly identified, which is really high.
Out of 3712 subscibers, only 42% were correctly defined, which is the field which should be improved.

Tree Model on Test Sample

It is time to find out, how our model will perform on the test sample.

set.seed(18)
predictions.on.test1 <- predict(tree_model, test)
confusionMatrix(predictions.on.test1, test$response_binary, positive = "1", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7030  541
##          1  210  386
##                                           
##                Accuracy : 0.908           
##                  95% CI : (0.9016, 0.9142)
##     No Information Rate : 0.8865          
##     P-Value [Acc > NIR] : 1.475e-10       
##                                           
##                   Kappa : 0.4588          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.64765         
##                  Recall : 0.41640         
##                      F1 : 0.50689         
##              Prevalence : 0.11351         
##          Detection Rate : 0.04726         
##    Detection Prevalence : 0.07298         
##       Balanced Accuracy : 0.69370         
##                                           
##        'Positive' Class : 1               
##

Accuracy has not dramatically changed and almost 91% of our clients are still correctly classified.

Again, we should have a closer look at positives and negatives to draw conclusions.

Out of 7240 non-subscribers, we still have almost 97% correctly classified clients.
Out of 927 subscibers, only 41% were correctly defined, which is the field which should be improved.

Resampling

We know that classes are imbalanced (4639 subscribers vs. 36202 non-subscribers). When there are strongly imbalanced classes, the cost of error on a large class is more expensive than on the small one, so the model is retrained towards the larger class. And we actually saw this from analysing tree model, since our it works well with prediction of non-subscribers (the larger class) and work awfully with prediction of subscribers (the smaller class). Therefore, we are to go with resampling to fix this.

cv5$seeds<-tree_model$control$seeds

cv5_smote<-cv5
cv5_smote$sampling<-"smote"
set.seed(18)
tree_model_smote <- caret::train(response_binary~., method = 'ctree', data = train, trControl=cv5_smote)

SMOTE on Train

set.seed(18)
s.on.train <- predict(tree_model_smote, train)
confusionMatrix(s.on.train, train$response_binary, positive = "1", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25414   640
##          1  3548  3072
##                                           
##                Accuracy : 0.8718          
##                  95% CI : (0.8682, 0.8754)
##     No Information Rate : 0.8864          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5256          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 0.46405         
##                  Recall : 0.82759         
##                      F1 : 0.59466         
##              Prevalence : 0.11361         
##          Detection Rate : 0.09402         
##    Detection Prevalence : 0.20261         
##       Balanced Accuracy : 0.85254         
##                                           
##        'Positive' Class : 1               
##

Accuracy has been lowered, but still remains high enough. For now, 87% of our clients were correctly assigned to the classes: subscribers and non-subscribers.

Next, we analyse positives and negatives:

Out of 28962 non-subscribers, almost 88% of clients were correctly identified, which is lower than in the first model, but still remains high.
Out of 3712 subscibers, almost 83% were correctly defined, which is pretty good and definitely much higher comparing with the previous model.

SMOTE on Test

It is time to check model`s performance on the test sample.

set.seed(18)
s.on.test <- predict(tree_model_smote, test)
confusionMatrix(s.on.test, test$response_binary, positive = "1", mode = "prec_recall")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6279  255
##          1  961  672
##                                           
##                Accuracy : 0.8511          
##                  95% CI : (0.8432, 0.8588)
##     No Information Rate : 0.8865          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4446          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 0.41151         
##                  Recall : 0.72492         
##                      F1 : 0.52500         
##              Prevalence : 0.11351         
##          Detection Rate : 0.08228         
##    Detection Prevalence : 0.19995         
##       Balanced Accuracy : 0.79609         
##                                           
##        'Positive' Class : 1               
##

Accuracy has slightly descreased, but still remains high. 85% of our clients were correctly classified.

The final analysis of positives and negatives!

Out of 7240 non-subscribers, we still have almost 87% correctly classified clients, which is pretty same as for train sample.
Out of 927 subscibers, 72% were correctly defined, which is good of our model.

Choosing the Best Tree Model

From the point of models` metrics, the result is obvious. The tree model with resampling make much better predictions. Although its accuracy is a bit lower, it works better in the case for correct class assignment for subscribers.
However, the initial model performs better in the case of predicting non-subscribers. From the point of view of monetary gains and losses, it will be more beneficial to use this model. In fact, this model will help to save company`s money, since there will be no need to waste resources on working with clients who will not subscribe anyway.
By contrast, if the company does not spend resources on attracting clients, but they still subscribe to the service, there is no need to contact them. Therefore, there is one more opportunity to save money.
In the result, we suggest to use the initial tree model as it will help to save money for implementing the future policies.

Local Interpretable Model-Agnostic Explanations

library(lime)
explainer = lime(x = train, model = tree_model)

explain1 = lime::explain(x = test[7:8,], 
                           explainer = explainer,
                           n_features = 8,
                           n_labels = 1)
plot_features(explain1)

We have chosen the tree model which better suits to client`s purposes. However, since not of our metrics seems to be reliable, we should come to final conclusion, whether or not we can trust our model. As a creator of this package points out, the best way to do this is to explain the rationale which lies behind individual predictions. Here we have to cases for subsription and refusal to subscribe.

For case 1, there is 0.83 probability that a client will refuse to subscribe. We can see that the facts that the person has housing or personal loans, then the probability that they will refuse increases. Also, the outcome of previous campaign is unknown, which means that it is likely than a customer will refuse to subscribe during this campaign, too. However, the fact that duration of a call was less than 5.32 second refers to the decrease of the probability to refuse subscription. In this case, a person is married and has housing loan (these two facts are probably in association)
For case 2, there is 0.51 probability that a client will approve to subscribe. The major fact which contributes to the increase of the chance to subscribe is that the duration of a call was less than 5.32 seconds. Moreover, an abscense of personal loan refers to increased probability to subscribe. At the same time, the factors which lower probability to subscibe are unknown outcome and age < 48.
To sum up, the way in which given factors influence the probability to refuse or accept subscription fit the general logics, so all explanations sounds reasonable. Therefore, we can trust our model and continue to work with it further.

Tree Model #2

treemodel <- partykit::ctree(response_binary~., data = train)
predictions.on.train2 <- predict(treemodel, train)
#confusionMatrix(predictions.on.train2, train$response_binary, positive = "1", mode = "prec_recall")

predictions.on.test <- predict(treemodel, test)
#confusionMatrix(predictions.on.test, test$response_binary, positive = "1", mode = "prec_recall")

Estimating variables` importance

vi(treemodel)

#vip(treemodel)
#ля

We can evaluate how significant the given variable is in terms of our tree model.

The most important variable is duration. Also, such variables (factors) as month, poutcome (outcome of the previous marketing campaign) and housing (having personal loan) are of greater importance for the model than others.
It turns out that a person`s age and credit default have no significance for our model.
These assumptions will be used while building a Bayesian model.

Bayesian model

Preparing data

For building a bayesian model, all the variables should be categorical.

bank.1 = bank2

bank.1$dur <- ifelse(bank2$duration < 1, "Less than 1sec",
                    ifelse(bank2$duration >= 1 & bank2$duration < 3, "1-3 sec", 
                    ifelse(bank2$duration >= 3 & bank2$duration <= 5, "3-5 sec",
                    ifelse(bank2$duration >= 5 & bank2$duration <= 10, "5-10 sec", "more than 10sec"))))
bank.1$dur = as.factor(bank.1$dur)

bank.1$bal <- ifelse(bank2$balance < 0, "Negative",
                    ifelse(bank2$balance >= 0 & bank2$balance < 100, "less than 100$", 
                    ifelse(bank2$balance >= 100 & bank2$balance <= 500, "100-500$",
                    ifelse(bank2$balance >= 500 & bank2$balance <= 1000, "500-1000$", "more than 1000$"))))

bank.1$pday <- ifelse(bank2$pdays <7, "less than week", "more than a week")
bank.1$prev <- ifelse(bank2$previous == 0, "no contacts", "contacts")
bank.1$pday = as.factor(bank.1$pday)
bank.1$prev = as.factor(bank.1$prev)
bank.1$dur = as.factor(bank.1$dur)
bank.1$bal = as.factor(bank.1$bal)
bank.1$campaign = as.factor(bank.1$campaign)
bank.1 = bank.1 %>% dplyr::select( -duration, -age, -balance)
bank.1 = bank.1 %>% dplyr::select( -pdays, -previous, -default)

Building Model from the Data

The first idea is to build the model automatically.

library(bnlearn)
bn_df <- data.frame(bank.1)
bnStructure = hc(bn_df)
#bnStructure
library(Rgraphviz)
graphviz.plot(bnStructure, shape = "ellipse")

For now, we have graphical representation of the variables for convinient arcs` setting.

Establishing arcs

whitelist = data.frame(from = c("month", "month", 'prev', 'pday', 'poutcome', 'dur', 'campaign', 'education', 'job', 'bal', 'loan', 'marital', 'housing'), 
                       to = c('prev', 'pday', 'poutcome', 'poutcome', 'response_binary', 'response_binary', 'response_binary', 'job', 'bal', 'loan', 'response_binary', 'housing', 'response_binary' ))

bnStructure = hc(bn_df, whitelist = whitelist)
bnStructure

## 
##   Bayesian network learned via Score-based methods
## 
##   model:
##    [education][job|education][marital|job:education][month|job][bal|job]
##    [loan|month:bal][housing|marital:loan:month][pday|housing:month]
##    [prev|month:pday][campaign|prev][poutcome|housing:pday:prev]
##    [dur|poutcome][response_binary|housing:loan:campaign:poutcome:dur]
##   nodes:                                 13 
##   arcs:                                  24 
##     undirected arcs:                     0 
##     directed arcs:                       24 
##   average markov blanket size:           5.38 
##   average neighbourhood size:            3.69 
##   average branching factor:              1.85 
## 
##   learning algorithm:                    Hill-Climbing 
##   score:                                 BIC (disc.) 
##   penalization coefficient:              5.308721 
##   tests used in the learning procedure:  253 
##   optimized:                             TRUE

library(Rgraphviz)
graphviz.plot(bnStructure, shape = "ellipse")

There are still misleading arcs, so we will fix it manually.

bnStructure  = drop.arc(bnStructure, from = "job", to = "month")
bnStructure  = drop.arc(bnStructure, from = "month", to = "housing")
bnStructure  = drop.arc(bnStructure, from = "job", to = "marital")
bnStructure  = drop.arc(bnStructure, from = "education", to = "marital")
#bnStructure  = drop.arc(bnStructure, from = "bal", to = "default")
#bnStructure  = drop.arc(bnStructure, from = "loan", to = "default")
bnStructure  = drop.arc(bnStructure, from = "job", to = "marital")
bnStructure  = drop.arc(bnStructure, from = "poutcome", to = "dur")
bnStructure  = drop.arc(bnStructure, from = "month", to = "loan")
bnStructure  = drop.arc(bnStructure, from = "poutcome", to = "dur")
bnStructure  = drop.arc(bnStructure, from = "loan", to = "housing")
bnStructure  = drop.arc(bnStructure, from = "housing", to = "poutcome")
bnStructure  = drop.arc(bnStructure, from = "housing", to = "pday")
bnStructure = set.arc(bnStructure, from = "bal", to = "response_binary")
bnStructure  = drop.arc(bnStructure, from = "pday", to = "prev")
bnStructure  = set.arc(bnStructure, from = "month", to = "campaign")
#bnStructure  = set.arc(bnStructure, from = "bal", to = "default")
#bnStructure  = set.arc(bnStructure, from = "bal", to = "default")
bnStructure  = drop.arc(bnStructure, from = "bal", to = "loan")
bnStructure  = set.arc(bnStructure, from = "loan", to = "bal")
bnStructure  = drop.arc(bnStructure, from = "loan", to = "response_binary")

Graph for Bayesian Model

graphviz.plot(bnStructure, shape = "ellipse")

Obviously, our data contains 2 types of variables: about clients and about the service. There is also the outcome variable response_binary, which defines whether a person has subscribed to the service.

Clients` variables

Возможно, стоит начать с response_binary и двигаться вверх в описании!

Education stands for an educational level a person has. There is an association between educational level and job position. The higher the educational level, the higher is the job position. RESEARCH
Job stands for a type of job. There is an association between a job and balance since different types of jobs are differently paid.
Loan stands for having a personal loan. There is an association between the loan and balance. If a person has a loan, their balance is decreased by the amount of loan.
Bal stands for level of balance. There is an association between the balance and subscription. ** CHECK** The higher is the balance, the higher is the probability that a person will subscribe.
Marital stands for marital status. There is an association between the marital status and having a housing loan. If a person is single, they will have lower housing loan, since they either live in smaller house or live with parents. RESEARCH
Housing stands for having a housing loan. There is an association between having the housing loan and subscription. The less is loan, the higher is the probability that a person will subscribe. может нужна стрелка к balance все-таки?

Service`s variables *`Month`stands for last contact month of the year. here is an association between the month and number of calls to a client. During the exploratory analysis, it was found out that march, september, october and december are ones with high subscription rates. Therefore, during these months all the indicators connected with bank`s performance during the campaigns were higher.

Campaign stands for number of contacts performed for a particular client during the last campaign. There is an association between the number of calls to a client and subscription. The more we communicate with a client, the higher is the probability that they will subscribe.
Prev stands for number of contacts performed before this campaign and for a particular client. There is an association between the number of calls to a client before the campaign and the outcome of the previous campaign. The more calls were done, the higher is the probability that the previous campaign was successful.
- There is also an association between the number of calls to a client before the campaign and number of contacts during this campaign. If a client heard about the service before, they will be able to come up with a decision quicker, so less calls will be needed.
`pdays’ stands for number of days that passed by after the client was last contacted from a previous campaign. НАВЕРНОЕ, НЕ ДОЛЖНО БЫТЬ СВЯЗИ С POUTCOMEThere is an association between the number of days passed and the subscription. хз, какая

Baesian Model`s Metrics

fitted = bn.fit(bnStructure, bn_df [1:32674, ])
testing = bn_df[32674:nrow(bn_df), ]
predicted = predict(fitted, node = "response_binary", data = testing)
table(predicted, testing[, "response_binary"])

##          
## predicted    0    1
##         0 5417 1818
##         1  143  408

Accuracy: 0.73 Specificity: 0.12 Recall: 0.98

Defining the Best Model

Сеть лучше с точки зрения того, что она лучше предсказывает людей, которые не будут подписываться на наш сервис. Следовательно, мы на контакт с ними время и деньги.

BA HW2

Anastasia Bakhareva, BSC-171

17.12.2019