Hello Everyone! At this page, I want to share about bank marketing in Portugal. I want to predict whether a customer would buy the product or not after receiving a call from the officer. I am going to use classification methods, which are naive bayes and decision tree algorithm. At the end, I will compare which methods produce the best results.
Let’s get started!
Packages loading
Data importing
Overview the data
## Observations: 4,521
## Variables: 17
## $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 3...
## $ job <fct> unemployed, services, management, management, blue-collar...
## $ marital <fct> married, married, single, married, married, single, marri...
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertia...
## $ default <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374,...
## $ housing <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes,...
## $ loan <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no...
## $ contact <fct> cellular, cellular, cellular, unknown, unknown, cellular,...
## $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, ...
## $ month <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, ap...
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113,...
## $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, ...
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, ...
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, ...
## $ poutcome <fct> unknown, failure, failure, unknown, unknown, failure, oth...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, ...
Here is the dataframe attribute information : Input variables:
age (numeric)
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
Since all of data class is in correct form, now I will check fot the missing value
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
Great!There are no missing values in the data.
## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 1 oct 79 1 -1 0 unknown no
## 2 may 220 1 339 4 failure no
## 3 apr 185 1 330 1 failure no
## 4 jun 199 4 -1 0 unknown no
## 5 may 226 1 -1 0 unknown no
## 6 feb 141 2 176 3 failure no
For this machine learning process, I will use Naive Bayes theorem and Decision Tree to predict y variable.
Since I am going to use Naive Bayes method, I have to make sure that all the predictors are factor type. I will try to convert the numeric variables to factor (some variables have to be grouped) and eliminate unused variables.
In order to make group in certain variables, I am trying to see ath the summary of the variables. I choose the 1st Quantile, Mean, and 3rd Quantile as the group category.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3313 69 444 1423 1480 71188
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 104 185 264 329 3025
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.794 3.000 50.000
bank_bayes<-bank %>%
mutate (age = as.factor(ifelse(age < 30, "<30",
ifelse(age>=30 & age <= 50, "30-50", ">50")))) %>%
mutate (balance = as.factor(ifelse(balance <69, "Low",
ifelse(balance >= 69 & balance <= 1480, "Medium", "High")))) %>%
mutate (duration = as.factor(ifelse(duration <104, "Short",
ifelse(duration >= 104 & duration <= 264, "Medium", "Long")))) %>%
mutate (campaign = as.factor(ifelse(campaign <=1, "Rarely",
ifelse(campaign > 1 & campaign <= 2, "Medium", "Often")))) %>%
select(-c(day, month, pdays))Great! Now, I have model for Naive Bayes.
I want to show you the proportion of the y target variable using this
p1<- bank_bayes %>%
group_by(y) %>%
summarise(freq = n()) %>%
ggplot(mapping = aes(x = y, y= freq)) +
geom_col(position = "stack", aes(fill = y, text = glue("Yes : {y}
Freq : {freq}")), width= NULL)+
theme_minimal()
ggplotly(p1, tooltip = "text") From the chart above, it seems that the proportion is unbalanced. To make a better prediction, it is necessary to make the data balannce. I am going to use upsample for the process.
Check for the proportion once again
p2<- bayes_up %>%
group_by(y) %>%
summarise(freq = n()) %>%
ggplot(mapping = aes(x = y, y= freq)) +
geom_col(position = "stack", aes(fill = y, text = glue("Yes : {y}
Freq : {freq}")), width= NULL)+
theme_minimal()
ggplotly(p2, tooltip = "text") Okay, the prportion has been balanced.
Now, I have to split the data into train and test for cross validation
Build the model using naivebayes() function
Done, model_bayes has been built
I will evaluate the model using predict() function. This function works for predicting the result by using data test_bayes
After that, I can use confusionmatrix() function to see the accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 643 223
## yes 158 576
##
## Accuracy : 0.7619
## 95% CI : (0.7402, 0.7826)
## No Information Rate : 0.5006
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5237
##
## Mcnemar's Test P-Value : 0.001042
##
## Sensitivity : 0.7209
## Specificity : 0.8027
## Pos Pred Value : 0.7847
## Neg Pred Value : 0.7425
## Prevalence : 0.4994
## Detection Rate : 0.3600
## Detection Prevalence : 0.4587
## Balanced Accuracy : 0.7618
##
## 'Positive' Class : yes
##
From the confusion matrix made, the accuracy of the data is quite good ~ 76%. However, the recall(sensitivity) of this case is important. Therefore, I have to reduce the value of false negative in order to increase the recall. To do that, I just have to decrease the threshold.
pred_imp <- predict(object = model_bayes, newdata = test_bayes, type = "raw")
predict_bayes_imp <- pred_imp %>%
as.data.frame() %>%
mutate(tuning_pred = as.factor(ifelse(yes >= 0.4 , "yes", "no")))Build another confusion matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 598 164
## yes 203 635
##
## Accuracy : 0.7706
## 95% CI : (0.7492, 0.791)
## No Information Rate : 0.5006
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5413
##
## Mcnemar's Test P-Value : 0.0473
##
## Sensitivity : 0.7947
## Specificity : 0.7466
## Pos Pred Value : 0.7578
## Neg Pred Value : 0.7848
## Prevalence : 0.4994
## Detection Rate : 0.3969
## Detection Prevalence : 0.5238
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : yes
##
After resize the threshold into 0.4, the number of false negative falls slightly from 218 to 147. This means that, the number of customers who are going to buy the product but predicted not to do so is decreased.
As we look at the accuracy, there is also a change, from 0.76 to 0.77. The model is getting better.
Using decision tree is quite easy. I only have to use rpart() function to create the model and then plot it using fancyRpartPlot() function.
Since I already have made the model bank, now I can continue to build the model.
First upsample the data.
index_tree <- sample(nrow(bank_tree_up), nrow(bank_tree_up)*0.8)
train_tree <- bank_tree_up[index_tree, ]
test_tree <- bank_tree_up [-index_tree, ]set.seed(123)
dtree <-rpart(formula = y~., data = train_tree, method = "class")
fancyRpartPlot(dtree, sub = NULL) A decision tree has several nodes that explain the probability of the predictors we use.
Each node shows:
Same as before, I wil make prediction and confusion matrix to evaluate the model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 617 155
## yes 169 659
##
## Accuracy : 0.7975
## 95% CI : (0.777, 0.8169)
## No Information Rate : 0.5088
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5948
##
## Mcnemar's Test P-Value : 0.4702
##
## Sensitivity : 0.8096
## Specificity : 0.7850
## Pos Pred Value : 0.7959
## Neg Pred Value : 0.7992
## Prevalence : 0.5088
## Detection Rate : 0.4119
## Detection Prevalence : 0.5175
## Balanced Accuracy : 0.7973
##
## 'Positive' Class : yes
##
From the matrix summary above, we can see that the decision tree model works well on the data. False negative number is low, hence recall(sensitivity) is something bigger than the previous model. This tree also gives much higher accuracy~80% which makes this model quite effective in creating prediction.
The curve may give us a view whether we should tune the model or no by evaluate the true positive rate vs false postive rate.
pred_prob <- pred_roc[,2]
bank_roc <- prediction(pred_prob, test_tree$y)
bank_performance <- performance(bank_roc, "tpr", "fpr") After analyzing the ROC curve, seems that I don’t have to perform tree pruning on the decision tree since it has already got a normal-sized tree.
data.frame(Model = c("Naive Bayes","Naive Bayes Tuned" ,"Decision Tree"),
Accuracy = c(0.76, 0.77, 0.79),
Sensitivity = c(0.73, 0.82, 0.89))## Model Accuracy Sensitivity
## 1 Naive Bayes 0.76 0.73
## 2 Naive Bayes Tuned 0.77 0.82
## 3 Decision Tree 0.79 0.89
Based on the table above, it can be concluded that decision tree is the best model among the others. It has highest accuracy and sensitivity which have an important role in making decision and really interpretable. However, naive bayes can also be called a good model due its speed in processing data and easy to use. In order to raise the sensitivity and accuracy, all we need to do is just take control of threshold.