library(readr)
library(mlbench)
library(ggplot2)
library(dplyr)
library(caret)
library(MASS)
library(boot)
library(e1071)
library(SDMTools)
library(caTools)
library(car)
library(CatEncoders)
library(caret)
library(kableExtra)
library(partykit)
library(vip)
bank_cleaned <- read_csv("/students/apbakhareva_1/BA/bank_cleaned.csv")bank = bank_cleaned
bank$age = as.numeric(as.character(bank$age))
bank$duration = as.numeric(as.character(bank$duration))
bank$balance = as.numeric(as.character(bank$balance))
bank$job = as.factor(bank$job)
bank$marital = as.factor(bank$marital)
bank$education = as.factor(bank$education)
bank$job = as.factor(bank$job)
bank$default = as.factor(bank$default)
bank$housing = as.factor(bank$housing)
bank$loan = as.factor(bank$loan)
bank$month = as.factor(bank$month)
bank$response_binary = as.factor(bank$response_binary)
bank$poutcome = as.factor(bank$poutcome)
bank2 = bank %>%
dplyr::select(age, job, marital, education, default, balance, housing, loan, duration, campaign, poutcome, response_binary, previous, pdays, month)cv5<-trainControl(method="cv", number = 5)# кросс-валидация, 5 групп, валидироваться 5 раз
set.seed(18)
tree_model <- caret::train(response_binary~., method = 'ctree', data = train, trControl=cv5)set.seed(18)
predictions.on.train1 <- predict(tree_model, train)
confusionMatrix(predictions.on.train1, train$response_binary, positive = "1", mode = "prec_recall")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 28085 2147
## 1 877 1565
##
## Accuracy : 0.9074
## 95% CI : (0.9043, 0.9106)
## No Information Rate : 0.8864
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4599
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.64087
## Recall : 0.42161
## F1 : 0.50861
## Prevalence : 0.11361
## Detection Rate : 0.04790
## Detection Prevalence : 0.07474
## Balanced Accuracy : 0.69566
##
## 'Positive' Class : 1
##
Next, let’s do a closer analysis of positives and negatives to gain more insight into our model’s performance.
It is time to find out, how our model will perform on the test sample.
set.seed(18)
predictions.on.test1 <- predict(tree_model, test)
confusionMatrix(predictions.on.test1, test$response_binary, positive = "1", mode = "prec_recall")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7030 541
## 1 210 386
##
## Accuracy : 0.908
## 95% CI : (0.9016, 0.9142)
## No Information Rate : 0.8865
## P-Value [Acc > NIR] : 1.475e-10
##
## Kappa : 0.4588
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.64765
## Recall : 0.41640
## F1 : 0.50689
## Prevalence : 0.11351
## Detection Rate : 0.04726
## Detection Prevalence : 0.07298
## Balanced Accuracy : 0.69370
##
## 'Positive' Class : 1
##
Again, we should have a closer look at positives and negatives to draw conclusions.
We know that classes are imbalanced (4639 subscribers vs. 36202 non-subscribers). When there are strongly imbalanced classes, the cost of error on a large class is more expensive than on the small one, so the model is retrained towards the larger class. And we actually saw this from analysing tree model, since our it works well with prediction of non-subscribers (the larger class) and work awfully with prediction of subscribers (the smaller class). Therefore, we are to go with resampling to fix this.
cv5_smote<-cv5
cv5_smote$sampling<-"smote"
set.seed(18)
tree_model_smote <- caret::train(response_binary~., method = 'ctree', data = train, trControl=cv5_smote)set.seed(18)
s.on.train <- predict(tree_model_smote, train)
confusionMatrix(s.on.train, train$response_binary, positive = "1", mode = "prec_recall")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25414 640
## 1 3548 3072
##
## Accuracy : 0.8718
## 95% CI : (0.8682, 0.8754)
## No Information Rate : 0.8864
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5256
##
## Mcnemar's Test P-Value : <2e-16
##
## Precision : 0.46405
## Recall : 0.82759
## F1 : 0.59466
## Prevalence : 0.11361
## Detection Rate : 0.09402
## Detection Prevalence : 0.20261
## Balanced Accuracy : 0.85254
##
## 'Positive' Class : 1
##
Next, we analyse positives and negatives:
It is time to check model`s performance on the test sample.
set.seed(18)
s.on.test <- predict(tree_model_smote, test)
confusionMatrix(s.on.test, test$response_binary, positive = "1", mode = "prec_recall")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6279 255
## 1 961 672
##
## Accuracy : 0.8511
## 95% CI : (0.8432, 0.8588)
## No Information Rate : 0.8865
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4446
##
## Mcnemar's Test P-Value : <2e-16
##
## Precision : 0.41151
## Recall : 0.72492
## F1 : 0.52500
## Prevalence : 0.11351
## Detection Rate : 0.08228
## Detection Prevalence : 0.19995
## Balanced Accuracy : 0.79609
##
## 'Positive' Class : 1
##
The final analysis of positives and negatives!
explain1 = lime::explain(x = test[7:8,],
explainer = explainer,
n_features = 8,
n_labels = 1)
plot_features(explain1)We have chosen the tree model which better suits to client`s purposes. However, since not of our metrics seems to be reliable, we should come to final conclusion, whether or not we can trust our model. As a creator of this package points out, the best way to do this is to explain the rationale which lies behind individual predictions. Here we have to cases for subsription and refusal to subscribe.
For case 1, there is 0.83 probability that a client will refuse to subscribe. We can see that the facts that the person has housing or personal loans, then the probability that they will refuse increases. Also, the outcome of previous campaign is unknown, which means that it is likely than a customer will refuse to subscribe during this campaign, too. However, the fact that duration of a call was less than 5.32 second refers to the decrease of the probability to refuse subscription. In this case, a person is married and has housing loan (these two facts are probably in association)
For case 2, there is 0.51 probability that a client will approve to subscribe. The major fact which contributes to the increase of the chance to subscribe is that the duration of a call was less than 5.32 seconds. Moreover, an abscense of personal loan refers to increased probability to subscribe. At the same time, the factors which lower probability to subscibe are unknown outcome and age < 48.
To sum up, the way in which given factors influence the probability to refuse or accept subscription fit the general logics, so all explanations sounds reasonable. Therefore, we can trust our model and continue to work with it further.
We can evaluate how significant the given variable is in terms of our tree model.
duration. Also, such variables (factors) as month, poutcome (outcome of the previous marketing campaign) and housing (having personal loan) are of greater importance for the model than others.For building a bayesian model, all the variables should be categorical.
bank.1 = bank2
bank.1$dur <- ifelse(bank2$duration < 1, "Less than 1sec",
ifelse(bank2$duration >= 1 & bank2$duration < 3, "1-3 sec",
ifelse(bank2$duration >= 3 & bank2$duration <= 5, "3-5 sec",
ifelse(bank2$duration >= 5 & bank2$duration <= 10, "5-10 sec", "more than 10sec"))))
bank.1$dur = as.factor(bank.1$dur)
bank.1$bal <- ifelse(bank2$balance < 0, "Negative",
ifelse(bank2$balance >= 0 & bank2$balance < 100, "less than 100$",
ifelse(bank2$balance >= 100 & bank2$balance <= 500, "100-500$",
ifelse(bank2$balance >= 500 & bank2$balance <= 1000, "500-1000$", "more than 1000$"))))
bank.1$pday <- ifelse(bank2$pdays <7, "less than week", "more than a week")
bank.1$prev <- ifelse(bank2$previous == 0, "no contacts", "contacts")
bank.1$pday = as.factor(bank.1$pday)
bank.1$prev = as.factor(bank.1$prev)
bank.1$dur = as.factor(bank.1$dur)
bank.1$bal = as.factor(bank.1$bal)
bank.1$campaign = as.factor(bank.1$campaign)
bank.1 = bank.1 %>% dplyr::select( -duration, -age, -balance)
bank.1 = bank.1 %>% dplyr::select( -pdays, -previous, -default)The first idea is to build the model automatically.
library(bnlearn)
bn_df <- data.frame(bank.1)
bnStructure = hc(bn_df)
#bnStructure
library(Rgraphviz)
graphviz.plot(bnStructure, shape = "ellipse")For now, we have graphical representation of the variables for convinient arcs` setting.
whitelist = data.frame(from = c("month", "month", 'prev', 'pday', 'poutcome', 'dur', 'campaign', 'education', 'job', 'bal', 'loan', 'marital', 'housing'),
to = c('prev', 'pday', 'poutcome', 'poutcome', 'response_binary', 'response_binary', 'response_binary', 'job', 'bal', 'loan', 'response_binary', 'housing', 'response_binary' ))##
## Bayesian network learned via Score-based methods
##
## model:
## [education][job|education][marital|job:education][month|job][bal|job]
## [loan|month:bal][housing|marital:loan:month][pday|housing:month]
## [prev|month:pday][campaign|prev][poutcome|housing:pday:prev]
## [dur|poutcome][response_binary|housing:loan:campaign:poutcome:dur]
## nodes: 13
## arcs: 24
## undirected arcs: 0
## directed arcs: 24
## average markov blanket size: 5.38
## average neighbourhood size: 3.69
## average branching factor: 1.85
##
## learning algorithm: Hill-Climbing
## score: BIC (disc.)
## penalization coefficient: 5.308721
## tests used in the learning procedure: 253
## optimized: TRUE
There are still misleading arcs, so we will fix it manually.
bnStructure = drop.arc(bnStructure, from = "job", to = "month")
bnStructure = drop.arc(bnStructure, from = "month", to = "housing")
bnStructure = drop.arc(bnStructure, from = "job", to = "marital")
bnStructure = drop.arc(bnStructure, from = "education", to = "marital")
#bnStructure = drop.arc(bnStructure, from = "bal", to = "default")
#bnStructure = drop.arc(bnStructure, from = "loan", to = "default")
bnStructure = drop.arc(bnStructure, from = "job", to = "marital")
bnStructure = drop.arc(bnStructure, from = "poutcome", to = "dur")
bnStructure = drop.arc(bnStructure, from = "month", to = "loan")
bnStructure = drop.arc(bnStructure, from = "poutcome", to = "dur")
bnStructure = drop.arc(bnStructure, from = "loan", to = "housing")
bnStructure = drop.arc(bnStructure, from = "housing", to = "poutcome")
bnStructure = drop.arc(bnStructure, from = "housing", to = "pday")
bnStructure = set.arc(bnStructure, from = "bal", to = "response_binary")
bnStructure = drop.arc(bnStructure, from = "pday", to = "prev")
bnStructure = set.arc(bnStructure, from = "month", to = "campaign")
#bnStructure = set.arc(bnStructure, from = "bal", to = "default")
#bnStructure = set.arc(bnStructure, from = "bal", to = "default")
bnStructure = drop.arc(bnStructure, from = "bal", to = "loan")
bnStructure = set.arc(bnStructure, from = "loan", to = "bal")
bnStructure = drop.arc(bnStructure, from = "loan", to = "response_binary")Obviously, our data contains 2 types of variables: about clients and about the service. There is also the outcome variable response_binary, which defines whether a person has subscribed to the service.
Возможно, стоит начать с response_binary и двигаться вверх в описании!
Education stands for an educational level a person has. There is an association between educational level and job position. The higher the educational level, the higher is the job position. RESEARCHJob stands for a type of job. There is an association between a job and balance since different types of jobs are differently paid.Loan stands for having a personal loan. There is an association between the loan and balance. If a person has a loan, their balance is decreased by the amount of loan.Bal stands for level of balance. There is an association between the balance and subscription. ** CHECK** The higher is the balance, the higher is the probability that a person will subscribe.Marital stands for marital status. There is an association between the marital status and having a housing loan. If a person is single, they will have lower housing loan, since they either live in smaller house or live with parents. RESEARCHHousing stands for having a housing loan. There is an association between having the housing loan and subscription. The less is loan, the higher is the probability that a person will subscribe. может нужна стрелка к balance все-таки?s variables *Monthstands for last contact month of the year. here is an association between the month and number of calls to a client. During the exploratory analysis, it was found out that march, september, october and december are ones with high subscription rates. Therefore, during these months all the indicators connected with banks performance during the campaigns were higher.Campaign stands for number of contacts performed for a particular client during the last campaign. There is an association between the number of calls to a client and subscription. The more we communicate with a client, the higher is the probability that they will subscribe.Prev stands for number of contacts performed before this campaign and for a particular client. There is an association between the number of calls to a client before the campaign and the outcome of the previous campaign. The more calls were done, the higher is the probability that the previous campaign was successful.
fitted = bn.fit(bnStructure, bn_df [1:32674, ])
testing = bn_df[32674:nrow(bn_df), ]
predicted = predict(fitted, node = "response_binary", data = testing)
table(predicted, testing[, "response_binary"])##
## predicted 0 1
## 0 5417 1818
## 1 143 408
Accuracy: 0.73 Specificity: 0.12 Recall: 0.98
Сеть лучше с точки зрения того, что она лучше предсказывает людей, которые не будут подписываться на наш сервис. Следовательно, мы на контакт с ними время и деньги.