Bank is a financial institution licensed to receive deposits and make loans. It may also provides financial services such as currency exchange and term deposit subscriptions. As a financial institution, it is important for bank to have some insight on what factors affect the customers’ choices to use their financial services. Furthermore, banking institutions need something that allows them to classify customers, i.e. knowing which customer will be interested on their product. So, using the bank marketing dataset, I will create models using Naive Bayes, Decision Tree, and Random Forest classifier. The result will be compared to see which classifier is the most suitable for predicting whether the customer will subscribe term deposit or not.
These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. A brief explanation about the required packages:
tidyverse: data manipulatione1071: naive bayes classifiercaret: upsampling/downsampling and confusion matrixROCR: evaluate ROC curve and AUCpartykit: decision tree classifierrsample: cross validationrandomForest: random forest classifierlibrary(tidyverse)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)bank <- read.csv("data_input/bank-full.csv",
sep = ";",
stringsAsFactors = T)
head(bank)glimpse(bank)## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, ~
## $ marital <fct> married, single, married, married, single, married, single, ~
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, ~
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,~
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y~
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no~
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, ~
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, ~
Here are some informations about the features:
We can see that there are 45.211 instances and 17 features in this dataset. Most of the features are categorical. We can see using glimpse(), all the data types are correct. However, there are some “unknowns” in the data. I will consider this as missing value. Let’s first check whether the data has missing and duplicated values.
Replacing “unknowns” with NA, then checking whether the data has any missing values.
bank <- bank %>%
mutate(across(.cols = everything(),
.fns = ~replace(., . == "unknown", NA)))
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 13020 0 0 0 0 0 0 36959
## y
## 0
contact and poutcome have a lot of missing values. In fact, most of the values in poutcome is unknown. I will drop these variables. I am also concerned about the day and month features. There is no information about whether the data was collected in the same year or not. If the data was collected not in the same year, those features will be meaningless, because the exact dates are not provided. So, I will drop day and month as well. Additionally, job and education also has some missing values. But the number is less than 5% of the data. So I will just drop the observations containing missing values in those two features.
bank <- bank %>%
select(-c(contact, poutcome, day, month)) %>%
drop_na()We have cleaned the missing values, let’s now check whether the data has any duplicated value. Systematically, in a dataset like this, I do not think that duplicated values should exist. Considering that there are features measuring the balance and number of days passed after the last campaign in the data, I personally think that every observation should have unique values.
sum(duplicated(bank))## [1] 1
The data only has 1 duplicated value. I will just remove it.
bank <- bank[!duplicated(bank), ]We will also see whether there is any feature that has almost no variance. Those features should be removed since it is not informative, thus will not contribute much during model construction. This is the final cleaned data.
bank <- bank[, -nearZeroVar(bank)]
bankWe want our target variable to have balanced class proportion so that the model could classify well on all classes instead of only the majority class. Let’s first check on the class proportion.
prop.table(table(bank$y))##
## no yes
## 0.8837516 0.1162484
table(bank$y)##
## no yes
## 38171 5021
We have a moderate class imbalance. This is a problem since proceeding with our data might lead to a model that can predict well on only one class. Let’s first split the data into 80% train set and 20% test set. The train set for constructing the model and the test set for measuring the out-of-sample accuracy.
set.seed(100)
bank_split <- initial_split(bank, prop = 0.8, strata = "y")
train <- training(bank_split)
test <- testing(bank_split)Rechecking the class proportion in the train set.
prop.table(table(train$y))##
## no yes
## 0.8837694 0.1162306
table(train$y)##
## no yes
## 30536 4016
Before proceeding to the modeling part, we need to handle the class imbalance. There are a few techniques that we can use: upsampling, downsampling, SMOTE, etc. Considering there are quite a lot of observations in the data, I will proceed to use downsampling to handle class imbalance. The method will remove some observations in the majority class, and will result in using 8,032 data for training the model.
train_down <- downSample(x = train[, -13],
y = train$y,
yname = "y")
table(train_down$y)##
## no yes
## 4016 4016
We are done with pre-processing the data. Now we can use it to construct and evaluate classifiers.
Naive Bayes is a simple technique for constructing classifiers by applying Bayes’ theorem between data features. It has one strong assumption, which is that all features of the dataset are equally important and independent. In real world data, this assumption is most likely to be violated. However, despite the violation, the Naive Bayes Classifier might give outstanding results. I will construct a Naive Bayes Classifier with laplace smoothing, just in case there is any predictor that never occurs in a class.
model_nb <- naiveBayes(y ~ ., data = train_down, laplace = 1)
conf_nb <- confusionMatrix(data = predict(model_nb, newdata = test),
reference = test$y,
positive = "yes")
conf_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6354 328
## yes 1281 677
##
## Accuracy : 0.8138
## 95% CI : (0.8054, 0.8219)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3583
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.67363
## Specificity : 0.83222
## Pos Pred Value : 0.34576
## Neg Pred Value : 0.95091
## Prevalence : 0.11632
## Detection Rate : 0.07836
## Detection Prevalence : 0.22662
## Balanced Accuracy : 0.75293
##
## 'Positive' Class : yes
##
We can see that Naive Bayes Classifier has 81.38% accuracy. For this data, the false negative and false positive case would be:
Deciding which case would be more severe to the bank is beyond my expertise. Feel free to decide which case is more severe, thus deciding which metric in the confusion matrix to be the most important. For now, I will use the ROC and AUC to see how good the model is at distinguishing binary classes.
nb_rocr <- prediction(predictions = predict(model_nb, newdata = test, type = "raw")[, 2],
labels = test$y,
label.ordering = c("no", "yes"))
plot(performance(nb_rocr,
measure = "tpr",
x.measure = "fpr"))# AUC
performance(nb_rocr, "auc")@y.values## [[1]]
## [1] 0.8194124
The problem about Naive Bayes classifier is that we can only tune the cutoff for the model. Doing so might lead to an increase in a metric listed in confusion matrix, with a tradeoff of a decrease in the other metric. However, the AUC value (which measures how good the model is at distinguishing classes) will not change. Let’s see the accuracy, recall, precision, and for different cutoffs.
pred_prob_nb <- predict(model_nb, newdata = test, type = "raw")metrics <- function(cutoff){
prediction <- as.factor(ifelse(pred_prob_nb[, 2] > cutoff, "yes", "no"))
conf <- confusionMatrix(prediction,
reference = test$y,
positive = "yes")
res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
return(res)
}cutoffs <- seq(0.01, 0.99, length = 99)
result <- matrix(nrow = 99, ncol = 4)
for(i in 1:99){
result[i, ] <- metrics(cutoffs[i])
}
result <- as.data.frame(result) %>%
rename(Accuracy = V1,
Recall = V2,
Specifity = V3,
Precision = V4) %>%
mutate(Cutoff = cutoffs)result %>%
gather(key = "metrics", value = "value", -Cutoff) %>%
ggplot(mapping = aes(x = Cutoff,
y = value,
col = metrics)) +
geom_line(lwd = 1) +
labs(title = "Metrics for Different Cutoffs",
y = "Value") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5))Again, feel free to decide which cutoff is the most optimum. I personally think that focusing on precision is not worth it, since the peak precision is below 50%. So, I would just focus on the other 3 metrics or just use another classifier.
Decision tree is probably one of the most popular machine learning method for classifying task. Unlike Naive Bayes, decision tree generates output in the form of rules, which means that it can easily be understood by humans and applied to many scenarios. Let’s first build a decision tree classifier with default options.
model_dt <- ctree(y ~ ., data = train_down)
model_dt##
## Model formula:
## y ~ age + job + marital + education + balance + housing + loan +
## duration + campaign + previous
##
## Fitted party:
## [1] root
## | [2] duration <= 251
## | | [3] duration <= 125
## | | | [4] duration <= 77
## | | | | [5] education in primary, secondary: no (n = 537, err = 1.9%)
## | | | | [6] education in tertiary: no (n = 243, err = 6.2%)
## | | | [7] duration > 77
## | | | | [8] housing in no
## | | | | | [9] campaign <= 1: no (n = 183, err = 39.3%)
## | | | | | [10] campaign > 1
## | | | | | | [11] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unemployed
## | | | | | | | [12] previous <= 0: no (n = 256, err = 11.7%)
## | | | | | | | [13] previous > 0: no (n = 33, err = 45.5%)
## | | | | | | [14] job in student: yes (n = 10, err = 20.0%)
## | | | | [15] housing in yes
## | | | | | [16] education in primary, secondary
## | | | | | | [17] previous <= 1: no (n = 254, err = 2.8%)
## | | | | | | [18] previous > 1: no (n = 54, err = 18.5%)
## | | | | | [19] education in tertiary: no (n = 117, err = 17.9%)
## | | [20] duration > 125
## | | | [21] housing in no
## | | | | [22] previous <= 0
## | | | | | [23] loan in no
## | | | | | | [24] campaign <= 1
## | | | | | | | [25] job in admin., management, retired, services, student: yes (n = 202, err = 33.7%)
## | | | | | | | [26] job in blue-collar, entrepreneur, housemaid, self-employed, technician, unemployed: no (n = 121, err = 38.8%)
## | | | | | | [27] campaign > 1
## | | | | | | | [28] duration <= 205
## | | | | | | | | [29] campaign <= 5
## | | | | | | | | | [30] balance <= 154: no (n = 69, err = 18.8%)
## | | | | | | | | | [31] balance > 154: no (n = 163, err = 41.1%)
## | | | | | | | | [32] campaign > 5: no (n = 46, err = 6.5%)
## | | | | | | | [33] duration > 205: no (n = 129, err = 48.8%)
## | | | | | [34] loan in yes
## | | | | | | [35] marital in divorced, married: no (n = 90, err = 6.7%)
## | | | | | | [36] marital in single: no (n = 30, err = 30.0%)
## | | | | [37] previous > 0
## | | | | | [38] duration <= 166: yes (n = 123, err = 31.7%)
## | | | | | [39] duration > 166: yes (n = 257, err = 9.3%)
## | | | [40] housing in yes
## | | | | [41] previous <= 0
## | | | | | [42] job in admin., housemaid, management, retired
## | | | | | | [43] balance <= 2679: no (n = 231, err = 16.5%)
## | | | | | | [44] balance > 2679
## | | | | | | | [45] duration <= 167: no (n = 20, err = 20.0%)
## | | | | | | | [46] duration > 167: yes (n = 15, err = 20.0%)
## | | | | | [47] job in blue-collar, entrepreneur, self-employed, services, student, technician, unemployed: no (n = 472, err = 6.8%)
## | | | | [48] previous > 0
## | | | | | [49] education in primary, secondary: no (n = 196, err = 34.7%)
## | | | | | [50] education in tertiary
## | | | | | | [51] duration <= 169: no (n = 33, err = 39.4%)
## | | | | | | [52] duration > 169: yes (n = 53, err = 24.5%)
## | [53] duration > 251
## | | [54] duration <= 490
## | | | [55] housing in no
## | | | | [56] loan in no
## | | | | | [57] previous <= 0
## | | | | | | [58] duration <= 441
## | | | | | | | [59] marital in divorced, married
## | | | | | | | | [60] age <= 59
## | | | | | | | | | [61] education in primary: no (n = 29, err = 24.1%)
## | | | | | | | | | [62] education in secondary, tertiary: yes (n = 279, err = 46.2%)
## | | | | | | | | [63] age > 59: yes (n = 82, err = 9.8%)
## | | | | | | | [64] marital in single
## | | | | | | | | [65] age <= 30: yes (n = 98, err = 13.3%)
## | | | | | | | | [66] age > 30: yes (n = 88, err = 40.9%)
## | | | | | | [67] duration > 441: yes (n = 96, err = 15.6%)
## | | | | | [68] previous > 0: yes (n = 415, err = 10.6%)
## | | | | [69] loan in yes
## | | | | | [70] previous <= 0: no (n = 77, err = 22.1%)
## | | | | | [71] previous > 0: yes (n = 9, err = 11.1%)
## | | | [72] housing in yes
## | | | | [73] previous <= 0
## | | | | | [74] duration <= 376
## | | | | | | [75] education in primary, secondary: no (n = 246, err = 15.4%)
## | | | | | | [76] education in tertiary: no (n = 99, err = 33.3%)
## | | | | | [77] duration > 376: no (n = 220, err = 41.4%)
## | | | | [78] previous > 0
## | | | | | [79] loan in no: yes (n = 224, err = 30.8%)
## | | | | | [80] loan in yes: no (n = 34, err = 44.1%)
## | | [81] duration > 490
## | | | [82] duration <= 706
## | | | | [83] housing in no
## | | | | | [84] loan in no: yes (n = 411, err = 14.4%)
## | | | | | [85] loan in yes: yes (n = 43, err = 34.9%)
## | | | | [86] housing in yes
## | | | | | [87] marital in divorced, married: yes (n = 302, err = 33.1%)
## | | | | | [88] marital in single: yes (n = 145, err = 17.2%)
## | | | [89] duration > 706: yes (n = 1198, err = 9.6%)
##
## Number of inner nodes: 44
## Number of terminal nodes: 45
plot(model_dt, type = "simple")It seems like the decision tree model is too complex. It is understandable though, since our data has many features. Let’s see the accuracy on both training and test set.
# Confusion matrix for training set
confusionMatrix(data = predict(model_dt, newdata = train_down, type = "response"),
reference = train_down$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3238 744
## yes 778 3272
##
## Accuracy : 0.8105
## 95% CI : (0.8018, 0.819)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.621
##
## Mcnemar's Test P-Value : 0.3976
##
## Sensitivity : 0.8147
## Specificity : 0.8063
## Pos Pred Value : 0.8079
## Neg Pred Value : 0.8132
## Prevalence : 0.5000
## Detection Rate : 0.4074
## Detection Prevalence : 0.5042
## Balanced Accuracy : 0.8105
##
## 'Positive' Class : yes
##
# Confusion matrix for test set
conf_dt <- confusionMatrix(data = predict(model_dt, newdata = test, type = "response"),
reference = test$y,
positive = "yes")
conf_dt## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6033 204
## yes 1602 801
##
## Accuracy : 0.791
## 95% CI : (0.7822, 0.7995)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3661
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.79701
## Specificity : 0.79018
## Pos Pred Value : 0.33333
## Neg Pred Value : 0.96729
## Prevalence : 0.11632
## Detection Rate : 0.09271
## Detection Prevalence : 0.27813
## Balanced Accuracy : 0.79360
##
## 'Positive' Class : yes
##
Looking at the difference between training set and testing set, I do not think that the model is overfit. Comparing the confusion matrix of decision tree model and Naive Bayes, we can see that the decision tree model, with slightly less accuracy, may still be better than Naive Bayes due to its high recall and specificity. If we were to tune the Naive Bayes cutoff to have similar accuracy and specifity, its sensitivity will still be significantly lower than the decision tree model. Let’s see ROC and AUC of the decision tree model.
dt_rocr <- prediction(predictions = predict(model_dt, newdata = test, type = "prob")[, 2],
labels = test$y,
label.ordering = c("no", "yes"))
plot(performance(dt_rocr,
measure = "tpr",
x.measure = "fpr"))# AUC
performance(dt_rocr, "auc")@y.values## [[1]]
## [1] 0.8629822
As we can see, the decision tree model has AUC of ~0.863. It means that decision tree is better than Naive Bayes in distinguishing between 2 classes. The decision tree can be tuned by setting mincriterion, minsplit, and minbucket in its control parameter. However, considering it will take too many iterations to find the best combination for those 3, I will just skip it for the time being. Besides, the model does not indicate that it is overfitting, and has pretty decent AUC. For now, let’s proceed to the last classifier, random forest.
Random forests is an ensemble learning method known for its versatility and performance. A random forest consists of many decision trees. Those decision trees were constructed using different observations based on the sampling, thus it have different characteristics. The prediction gained in random forest is average of all the predictions from these different decision trees. First, let’s set the random forest control so it uses 5 fold cross validation with 3 iterations.
# The model was made using this code
#set.seed(100)
#ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
#model_rf <- train(y ~., data = train_down, method = "rf", trControl = ctrl)model_rf <- readRDS("model_rf.rds")
model_rf## Random Forest
##
## 8032 samples
## 10 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 6426, 6426, 6425, 6426, 6425, 6425, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7991356 0.5982723
## 12 0.8056515 0.6113023
## 23 0.8028712 0.6057415
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
Checking the final random forest model.
model_rf$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 19.4%
## Confusion matrix:
## no yes class.error
## no 3162 854 0.2126494
## yes 704 3312 0.1752988
plot(model_rf)The random forest has Out of Bag error rate of 19.4%. Looking the confusion matrix for both training and test set.
confusionMatrix(data = predict(model_rf, newdata = train_down),
reference = train_down$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4016 0
## yes 0 4016
##
## Accuracy : 1
## 95% CI : (0.9995, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : yes
##
conf_rf <- confusionMatrix(data = predict(model_rf, newdata = test),
reference = test$y,
positive = "yes")
conf_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6000 175
## yes 1635 830
##
## Accuracy : 0.7905
## 95% CI : (0.7818, 0.799)
## No Information Rate : 0.8837
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3751
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.82587
## Specificity : 0.78585
## Pos Pred Value : 0.33671
## Neg Pred Value : 0.97166
## Prevalence : 0.11632
## Detection Rate : 0.09606
## Detection Prevalence : 0.28530
## Balanced Accuracy : 0.80586
##
## 'Positive' Class : yes
##
Turns out that the random forest classifier is overfitting. The accuracy and metrics for the training set is perfect, but for the test set, the accuracy is only about 70%, with decent recall and specificity, but poor precision. Let’s see the ROC and AUC.
rf_rocr <- prediction(predictions = predict(model_rf, newdata = test, type = "prob")$yes,
labels = test$y,
label.ordering = c("no", "yes"))
plot(performance(rf_rocr,
measure = "tpr",
x.measure = "fpr"))# AUC
performance(rf_rocr, "auc")@y.values## [[1]]
## [1] 0.8737758
The random forest model’s ability to distinguish classes is slightly higher than of decision tree’s, with AUC value of 0.8738.
varImp(model_rf)## rf variable importance
##
## only 20 most important variables shown (out of 23)
##
## Overall
## duration 100.0000
## balance 32.9411
## age 27.6447
## previous 16.6702
## housingyes 11.2165
## campaign 10.2486
## loanyes 3.9804
## educationtertiary 3.2654
## maritalmarried 3.0600
## jobblue-collar 2.8713
## maritalsingle 2.7803
## jobmanagement 2.6442
## educationsecondary 2.6436
## jobtechnician 2.6404
## jobservices 1.6528
## jobunemployed 1.1723
## jobself-employed 1.0584
## jobentrepreneur 1.0052
## jobstudent 0.9884
## jobretired 0.9608
We can also see using variable importance, that the most significant variable is duration (last contact duration, in seconds), followed by balance and age. It may be reasonable since longer duration during contacting clients might imply that the client is interested with the offer.
This is a dataframe consisting of Naive Bayes, Decision Tree, and Random Forest models’ performance.
Looking at the dataframe, we can conclude that naive bayes classifier, despite having the highest accuracy, is the worst at distinguishing classes (although the difference is not really significant). Random forest performance is similar to decision tree. Random forest seems to be able to distinguish whether the bank clients will subscribe to a term deposit or not slightly better than decision tree. However, considering how interpretable and adaptable decision tree is, it is better to use decision tree rather than random forest classifier. Looking at the variable importance of random forest model, it can be seen that the most significant feature for detecting customer’s choice in subscribing term deposit is the contact duration. Longer contact duration might imply that the client is interested in the bank institutions’ offers.