Banking Deposit Investment Classification Using Naive Bayes, Decision Tree and Random Forest
Introduction
There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.
Previously, we have answered the bank’s problem using Logistic Regression and K-Nearest Neighbor. If you want to check on the report, click here.
Now, in this report, we are going to answer the bank’s problem using Naive Bayes, Decision Tree and Random Forest. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, Model Building & Prediction, Model Evaluation and Conclusion.
Data Preparation
# Library Input
library(dplyr)
library(ggplot2)
library(GGally)
library(car)
library(caret)
library(partykit)
library(rsample)
library(e1071)
library(randomForest)
library(ROCR)
library(rpart)
library(rpart.plot)# Data Input
bank <- read.csv("data/bank_investments.csv", stringsAsFactors = T)
head(bank)## age job marital education default housing loan contact
## 1 49 blue-collar married basic.9y unknown no no cellular
## 2 37 entrepreneur married university.degree no no no telephone
## 3 78 retired married basic.4y no no no cellular
## 4 36 admin. married university.degree no yes no telephone
## 5 59 retired divorced university.degree no no no cellular
## 6 29 admin. single university.degree no no no cellular
## month day_of_week duration campaign pdays previous poutcome y
## 1 nov wed 227 4 999 0 nonexistent no
## 2 nov wed 202 2 999 1 failure no
## 3 jul mon 1148 1 999 0 nonexistent yes
## 4 may mon 120 2 999 0 nonexistent no
## 5 jun tue 368 2 999 0 nonexistent no
## 6 aug wed 256 2 999 0 nonexistent no
The data used in this report contains 32950 data and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010). The data set consists of several variables with the following details:
age: the age of the clientjob: type of jobmarital: marital statuseducation: client last educationdefault: whether or not the client has credit in defaultbalance: the balance in client accounthousing: does the client has housing loan?loan: does the client has personal loan?contact: contact communication typemonth: last contact month of yearday_of_week: last contact day of the weekduration: last contact duration, in secondscampaign: number of contacts performed during this campaign and for this client (includes last contact)pdays: number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)previous: number of contacts performed before this campaign and for this clientpoutcome: outcome of the previous marketing campaign
Target Variable:
y: has the client subscribed a term deposit?
# Checking Missing Values
colSums(is.na(bank))## age job marital education default housing
## 0 0 0 0 0 0
## loan contact month day_of_week duration campaign
## 0 0 0 0 0 0
## pdays previous poutcome y
## 0 0 0 0
All data types have been converted to the desired data types and there’s no more missing value.
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.
# Data Summary
summary(bank)## age job marital education
## Min. :17.00 admin. :8314 divorced: 3675 university.degree :9736
## 1st Qu.:32.00 blue-collar:7441 married :19953 high.school :7596
## Median :38.00 technician :5400 single : 9257 basic.9y :4826
## Mean :40.01 services :3196 unknown : 65 professional.course:4192
## 3rd Qu.:47.00 management :2345 basic.4y :3322
## Max. :98.00 retired :1366 basic.6y :1865
## (Other) :4888 (Other) :1413
## default housing loan contact
## no :26007 no :14900 no :27131 cellular :20908
## unknown: 6940 unknown: 796 unknown: 796 telephone:12042
## yes : 3 yes :17254 yes : 5023
##
##
##
##
## month day_of_week duration campaign pdays
## may :11011 fri:6322 Min. : 0.0 Min. : 1.000 Min. : 0.0
## jul : 5763 mon:6812 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0
## aug : 4948 thu:6857 Median : 180.0 Median : 2.000 Median :999.0
## jun : 4247 tue:6444 Mean : 258.1 Mean : 2.561 Mean :962.1
## nov : 3266 wed:6515 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## apr : 2085 Max. :4918.0 Max. :56.000 Max. :999.0
## (Other): 1630
## previous poutcome y
## Min. :0.0000 failure : 3429 no :29238
## 1st Qu.:0.0000 nonexistent:28416 yes: 3712
## Median :0.0000 success : 1105
## Mean :0.1747
## 3rd Qu.:0.0000
## Max. :7.0000
##
It is important to note that the dataset need to has the same scale. Hence, a further scaling might be needed.
# Correlation of Numeric Variables
ggcorr(bank, hjust = 1, layout.exp = 2, label = T, label_size = 4,
low = "#d53e4f", mid = "white", high = "#fddbc7")Based on the correlation plot, there are no correlation between the numeric predictor except for age and previous which has a negative correlation. Since Naive Bayes assumes each predictor is independent (not related to each other), it is important to note that each predictor has no correlation to each other.
# Target Variable Proportion
investment_y <- bank %>%
select(y) %>%
group_by(y) %>%
summarise(count = length(y))
ggplot(data = investment_y, mapping = aes(x = reorder(y, -count), y = count)) +
geom_col(mapping = aes(fill = y),
position = "dodge") +
labs(title = "Customer Investment Proportion",
fill = "Investment",
x = "Number of Customers",
y = NULL) +
scale_fill_manual(values = c("#d53e4f", "#fddbc7")) +
theme_minimal()As can be seen above, the target variable clearly is disproportioned. Further resampling will be needed.
Data Pre-Processing
In this process, we will split the data into train data set and test data set. The train data set will be used to build models, while the test data set will be used to predict the target variable. We will took 80% of the data as the train data set and the rest will be used as the test data set.
# Splitting the data set
RNGkind(sample.kind = "Rounding")
set.seed(123)
index_bank <- sample(x = nrow(bank) , size = nrow(bank)*0.8)
bank_train <- bank[index_bank, ]
bank_test <- bank[-index_bank, ]Next, we will check the data proportion, whether or not the data is imbalance.
# Checking data proportion
prop.table(table(bank$y))##
## no yes
## 0.8873445 0.1126555
# Checking data proportion of train data set
prop.table(table(bank_train$y))##
## no yes
## 0.8876328 0.1123672
# Checking data proportion of test data set
prop.table(table(bank_test$y))##
## no yes
## 0.8861912 0.1138088
Since the class distribution in the target variable is around 89 : 11 indicating an imbalance dataset, we need to resample it.
# Downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)
bank_train_down <- downSample(x = bank_train %>% select(-y),
y = bank_train$y,
yname = "y")prop.table(table(bank_train_down$y))##
## no yes
## 0.5 0.5
A balanced proportion of classes is important for the data train because we will be training the model using the data train.
Model Building & Prediction
Naive Bayes
# Model Building (all variables)
model_nb <- naiveBayes(formula = y ~ . , data = bank_train_down, laplace = 1)Naive Bayes has the Skewness Due To Scarcity characteristic. Which indicates that when there is a predictor whose frequency value is 0 for one of the classes, then the model will automatically predicts that the probability is 0 for that condition, regardless of the value of the predictor other. This will affect the model and makes it becomes biased or less accurate in making predictions. To avoid this, we can do Laplace Smoothing, which the function is to add the frequency of each predictor by a certain number (usually 1), so there would be no longer a predictor has a value of 0.
# Model Output
model_nb##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## no yes
## 0.5 0.5
##
## Conditional probabilities:
## age
## Y [,1] [,2]
## no 39.93991 9.807662
## yes 40.95307 13.877065
##
## job
## Y admin. blue-collar entrepreneur housemaid management retired
## no 0.248823134 0.228312038 0.041694687 0.024882313 0.073974445 0.033624748
## yes 0.291526564 0.133826496 0.028917283 0.022528581 0.070948218 0.096503026
## job
## Y self-employed services student technician unemployed unknown
## no 0.029253531 0.106590451 0.017148621 0.164761264 0.023537323 0.007397445
## yes 0.034969738 0.069939475 0.057162071 0.156018830 0.029589778 0.008069939
##
## marital
## Y divorced married single unknown
## no 0.118341200 0.611260958 0.269049225 0.001348618
## yes 0.097774781 0.544504383 0.354012138 0.003708699
##
## education
## Y basic.4y basic.6y basic.9y high.school illiterate
## no 0.0966329966 0.0579124579 0.1481481481 0.2414141414 0.0003367003
## yes 0.0966329966 0.0397306397 0.0983164983 0.2205387205 0.0013468013
## education
## Y professional.course university.degree unknown
## no 0.1276094276 0.2851851852 0.0427609428
## yes 0.1262626263 0.3643097643 0.0528619529
##
## default
## Y no unknown yes
## no 0.7801011804 0.2192242833 0.0006745363
## yes 0.9045531197 0.0951096121 0.0003372681
##
## housing
## Y no unknown yes
## no 0.44485666 0.02293423 0.53220911
## yes 0.43912310 0.02327150 0.53760540
##
## loan
## Y no unknown yes
## no 0.82563238 0.02293423 0.15143339
## yes 0.82462057 0.02327150 0.15210793
##
## contact
## Y cellular telephone
## no 0.5843455 0.4156545
## yes 0.8275978 0.1724022
##
## month
## Y apr aug dec jul jun mar
## no 0.051480485 0.157469717 0.001682369 0.163526245 0.146029610 0.010094213
## yes 0.108344549 0.141655451 0.020188425 0.148048452 0.121130552 0.059555855
## month
## Y may nov oct sep
## no 0.347913863 0.095558546 0.014131898 0.012113055
## yes 0.187079408 0.088829071 0.070995962 0.054172275
##
## day_of_week
## Y fri mon thu tue wed
## no 0.1985170 0.2200876 0.1981800 0.1921132 0.1911021
## yes 0.1840243 0.1877317 0.2214358 0.2025615 0.2042467
##
## duration
## Y [,1] [,2]
## no 219.3518 197.3430
## yes 551.2154 402.9288
##
## campaign
## Y [,1] [,2]
## no 2.555706 2.793423
## yes 2.039838 1.657146
##
## pdays
## Y [,1] [,2]
## no 980.2346 135.2032
## yes 781.7677 410.5785
##
## previous
## Y [,1] [,2]
## no 0.1390952 0.4257392
## yes 0.5104659 0.8790741
##
## poutcome
## Y failure nonexistent success
## no 0.10118044 0.88229342 0.01652614
## yes 0.13086003 0.66779089 0.20134907
# Predict Data Test
pred_nb <- predict(object = model_nb, newdata = bank_test, type = "class")
table(predict = pred_nb,
actual = bank_test$y)## actual
## predict no yes
## no 5336 317
## yes 504 433
Decision Tree
# Model Building (all variables)
model_dt <- rpart(formula = y ~ ., data = bank_train_down, method = "class")
model_dt## n= 5924
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 5924 2962 no (0.50000000 0.50000000)
## 2) duration< 206.5 2343 495 no (0.78873239 0.21126761)
## 4) pdays>=512.5 2149 335 no (0.84411354 0.15588646)
## 8) month=aug,jul,jun,may,nov 1814 156 no (0.91400221 0.08599779) *
## 9) month=apr,dec,mar,oct,sep 335 156 yes (0.46567164 0.53432836)
## 18) duration< 95.5 90 13 no (0.85555556 0.14444444) *
## 19) duration>=95.5 245 79 yes (0.32244898 0.67755102) *
## 5) pdays< 512.5 194 34 yes (0.17525773 0.82474227) *
## 3) duration>=206.5 3581 1114 yes (0.31108629 0.68891371)
## 6) duration< 457.5 1847 845 yes (0.45749865 0.54250135)
## 12) pdays>=513 1498 669 no (0.55340454 0.44659546)
## 24) month=aug,jul,jun,may,nov 1151 383 no (0.66724587 0.33275413) *
## 25) month=apr,dec,mar,oct,sep 347 61 yes (0.17579251 0.82420749) *
## 13) pdays< 513 349 16 yes (0.04584527 0.95415473) *
## 7) duration>=457.5 1734 269 yes (0.15513264 0.84486736) *
# Decision Tree Plot
rpart.plot(model_dt, type = 2, nn = TRUE, extra = "auto",
box.palette = c("#d53e4f", "#fddbc7"), shadow.col = "grey")Based on the decision tree plot above, we can see the number of divisions/ leaves (width) and how many layers/ levels (depth) in it.
- On top of the diagram, [1] is called Root Node. It’s the first branch in determining the target value, commonly referred to as the main predictor. In our case, it shows the proportion of customers that accepted the investment offer. 50% of customers accepted the offer.
- [2], [3], [4], [6], [9], [12] are called Internal Nodes. Internal Nodes (branches) are indicated by an arrow pointing at them, and an arrow pointing from them. For example, node [2] asks whether the duration of the call is over 207 or less. If yes, then you go down to the root’s left child node (depth 2). 40% are calls with the duration less than 207 with a accepted investment offer probability of 21%; and so on.
- [8], [18], [19], [5], [24], [25], [13], and [7] are Leaf Nodes. Leaf Nodes are indicated by arrows pointing at them, but no arrows are pointing from them.
# Predict Data Test
pred_dt <- predict(object = model_dt, newdata = bank_test, type = "class")
table(predict = pred_dt,
actual = bank_test$y)## actual
## predict no yes
## no 4871 155
## yes 969 595
Originally, the prediction need to be run only on the test data set. But we will predict based on the train data set as well to check whether or not our model is over-fitted.
# Predict Data Train
pred_dt_train <- predict(object = model_dt, newdata = bank_train_down, type = "class")
table(predict = pred_dt_train,
actual = bank_train_down$y)## actual
## predict no yes
## no 2503 552
## yes 459 2410
Random Forest
Before building the model of Random Forest, we will delete the columns that have a variance close to zero. The disadvantages of Random Forest is that the computational load is very long. This can be reduced by selecting predictors so that there are not too many. If we find a large number of columns, we can delete the columns that have a variance close to zero (less informative).
# Feature Selection using nearzerovar
no_var_train <- nearZeroVar(bank_train_down)
no_var_test <- nearZeroVar(bank_test)
bank_train_novar <- bank_train_down[, -no_var_train]
bank_test_novar <- bank_test[, -no_var_test]
head(bank_train_novar)## age job marital education default housing loan contact
## 1 47 retired single basic.6y unknown no no cellular
## 2 60 admin. single high.school no yes no telephone
## 3 39 blue-collar married high.school unknown no no telephone
## 4 61 technician married professional.course no no no cellular
## 5 31 blue-collar married basic.6y unknown yes no telephone
## 6 41 technician married university.degree no yes no telephone
## month day_of_week duration campaign previous poutcome y
## 1 jul mon 157 5 0 nonexistent no
## 2 may thu 609 1 0 nonexistent no
## 3 may wed 278 1 0 nonexistent no
## 4 oct thu 76 1 0 nonexistent no
## 5 may wed 298 1 0 nonexistent no
## 6 may fri 67 3 0 nonexistent no
In this model, we are going to use a technique for model evaluation called K-fold Cross-Validation. This technique performs cross-validation by splitting our data into k equal-sized sample group (bins) and use one of the bins to become the test data while the rest of the data become the train data. This process is repeated for k-times (the folds).
For example, we will create a random forest model with k-fold cross validation (k=5) and the creation of the k-fold set is done 3 times:
# Cross Validation
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# Model Building
model_rf <- train(y ~ ., data = bank_train_novar, method = "rf", trControl = ctrl)
# Saving Model
saveRDS(model_rf, file = "model_rf.rds")# Model Summary
rf_model <- readRDS("model_rf.rds")
rf_model## Random Forest
##
## 5924 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4740, 4740, 4738, 4739, 4739, 4740, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7910180 0.5820361
## 24 0.8529709 0.7059414
## 47 0.8490881 0.6981762
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.
Based on the model summary, the most optimal mtry value used for the model was mtry = 24 with the Accuracy around 85%.
# Predict Data Test
pred_rf <- predict(rf_model, newdata = bank_test)
table(predict = pred_rf,
actual = bank_test$y)## actual
## predict no yes
## no 4810 86
## yes 1030 664
Model Evaluation
Naive Bayes
# Confusion Matrix
confusionMatrix(data = pred_nb, reference = bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 5336 317
## yes 504 433
##
## Accuracy : 0.8754
## 95% CI : (0.8672, 0.8833)
## No Information Rate : 0.8862
## P-Value [Acc > NIR] : 0.9969
##
## Kappa : 0.4429
##
## Mcnemar's Test P-Value : 8.502e-11
##
## Sensitivity : 0.57733
## Specificity : 0.91370
## Pos Pred Value : 0.46211
## Neg Pred Value : 0.94392
## Prevalence : 0.11381
## Detection Rate : 0.06571
## Detection Prevalence : 0.14219
## Balanced Accuracy : 0.74552
##
## 'Positive' Class : yes
##
Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 87.5%. However in our case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which unfortunately quite low, around 58%.
And since our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.
# ROC
pred_nbProb <- predict(model_nb, newdata = bank_test, type = "raw")
df_roc_bank <- data.frame(prob = pred_nbProb[,2],
label = as.numeric(bank_test$y == "yes"))
prediction_roc_bank <- prediction(predictions = df_roc_bank$prob,
labels = df_roc_bank$label)
plot(performance(prediction.obj = prediction_roc_bank,
measure = "tpr", x.measure = "fpr")) ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.
# AUC
auc_bank <- performance(prediction.obj = prediction_roc_bank, measure = "auc")
auc_bank@y.values## [[1]]
## [1] 0.8683594
AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is 0.87, which means it’s a good model and has a good measure of separability.
Decision Tree
In this section, we will try to compare the confusion matrix of both data train prediction and data test prediction. The objective is to check whether the model is overfitted or not (since the drawback of the Decision Tree model is its tendency to be overfitted). If so, further tree pruning might be needed.
# Confusion Matrix of Data Train Prediction
confusionMatrix(data = pred_dt_train, reference = bank_train_down$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2503 552
## yes 459 2410
##
## Accuracy : 0.8293
## 95% CI : (0.8195, 0.8388)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6587
##
## Mcnemar's Test P-Value : 0.003811
##
## Sensitivity : 0.8136
## Specificity : 0.8450
## Pos Pred Value : 0.8400
## Neg Pred Value : 0.8193
## Prevalence : 0.5000
## Detection Rate : 0.4068
## Detection Prevalence : 0.4843
## Balanced Accuracy : 0.8293
##
## 'Positive' Class : yes
##
# Confusion Matrix of Data Test Prediction
confusionMatrix(data = pred_dt, reference = bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4871 155
## yes 969 595
##
## Accuracy : 0.8294
## 95% CI : (0.8201, 0.8384)
## No Information Rate : 0.8862
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4259
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79333
## Specificity : 0.83408
## Pos Pred Value : 0.38043
## Neg Pred Value : 0.96916
## Prevalence : 0.11381
## Detection Rate : 0.09029
## Detection Prevalence : 0.23733
## Balanced Accuracy : 0.81370
##
## 'Positive' Class : yes
##
Based on the Confusion Matrix result of both data set, we can conclude that this model is not overfitted. It can be seen from the Accuracy of data train prediction and data test prediction are both around 83%.
As explained before, in this case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can).
Form the confusion matrix, it can be seen that the Sensitivity of data train prediction is around 81% and the Sensitivity of data test prediction is around 79%. Which means the model is quite good and tree pruning might not be neccesary.
But because our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.
# ROC
pred_dtProb <- predict(object = model_dt, newdata = bank_test, type = "prob")
df_roc_bank_dt <- data.frame(prob = pred_dtProb[,2],
label = as.numeric(bank_test$y == "yes"))
pred_roc_bank_dt <- prediction(predictions = df_roc_bank_dt$prob,
labels = df_roc_bank_dt$label)
plot(performance(prediction.obj = pred_roc_bank_dt,
measure = "tpr", x.measure = "fpr"))ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.
# AUC
auc_bank_dt <- performance(prediction.obj = pred_roc_bank_dt, measure = "auc")
auc_bank_dt@y.values## [[1]]
## [1] 0.8640068
AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is 0.86, which means it’s a good model and has a good measure of separability.
Random Forest
When creating a Random Forest model, we are not required to split the training-testing dataset up front. This is because from the results of boostrap sampling, there are data that are not used in making the random forest model. These data are out-of-bag data and are considered as test data by the model. The model will make predictions with the data and calculate the resulting error. These errors are referred to as out-of-bag errors.
#Out of Bag Error
rf_model$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 24
##
## OOB estimate of error rate: 14.65%
## Confusion matrix:
## no yes class.error
## no 2447 515 0.1738690
## yes 353 2609 0.1191762
In this model, the value of Out of Bag Error is 14.65%. In other words, the accuracy of the model on the test data (out of bag data) is 85.35%.
Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (the most important) in the random forest model:
# Variable Importance
plot(varImp(rf_model), col = "#d53e4f")By the result, we can see that duration is the most important variable followed by age, poutcomesuccess and contacttelephone. Next, we will check the confusion matrix of the prediction that we have done previously.
# Confusion Matrix of Data Test Prediction
confusionMatrix(data = pred_rf, reference = bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4810 86
## yes 1030 664
##
## Accuracy : 0.8307
## 95% CI : (0.8214, 0.8396)
## No Information Rate : 0.8862
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4578
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8853
## Specificity : 0.8236
## Pos Pred Value : 0.3920
## Neg Pred Value : 0.9824
## Prevalence : 0.1138
## Detection Rate : 0.1008
## Detection Prevalence : 0.2571
## Balanced Accuracy : 0.8545
##
## 'Positive' Class : yes
##
Based on the Confusion Matrix result above, it can be seen from the Accuracy of data test prediction is around 83%.
As explained before, in this case, we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can).
Form the confusion matrix, it can be seen that the Sensitivity of ata test prediction is around 88.5%. Which means the model is quite good.
But because our data is imbalanced, and we want to see how well our model distinguishes between Positive and Negative classes, we will use ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics.
# ROC
pred_rfProb <- predict(object = rf_model, newdata = bank_test, type = "prob")
df_roc_bank_rf <- data.frame(prob = pred_rfProb[,2],
label = as.numeric(bank_test$y == "yes"))
pred_roc_bank_rf <- prediction(predictions = df_roc_bank_rf$prob,
labels = df_roc_bank_rf$label)
plot(performance(prediction.obj = pred_roc_bank_rf,
measure = "tpr", x.measure = "fpr"))ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). Since the closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is, so we can conclude that our model has a good measure of separability.
# AUC
auc_bank_rf <- performance(prediction.obj = pred_roc_bank_rf, measure = "auc")
auc_bank_rf@y.values## [[1]]
## [1] 0.9182189
AUC shows the area under the ROC curve. The closer it is to 1, the better the model’s performance. Our model’s AUC value is around 0.92, which means it’s a great model and has a great measure of separability.
Conclusion
After predicting with three models, Naive Bayes model, Decision Tree model and Random Forest model, there’s no significant difference in term of Accuracy. The Naive Bayes model has 87.5% of Accuracy, while the Decision Tree model and Random Forest model both have around 83%. So based on the Accuracy, the Naive Bayes model is the best, which means it can distinguishes between Positive and Negative classes the best compared to the other two.
However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, so if we want to see how well our model distinguishes between Positive and Negative classes, ROC(Receiver-Operating Curve) and AUC(Area Under Curve) metrics are needed. Based on the AUC(Area Under Curve) value, the Naive Bayes model has 0.87 of AUC, the Decision Tree model has 0.86 of AUC and Random Forest model has 0.92 of AUC. So naturally, the Random Forest model can distinguishes between Positive and Negative classes the best.
Another factor to be considered is that in this case, we would want to approach / offer as many customer as we can. So, we tend to rely on Sensitivity. So based on that point of view, the Random Forest model is better, it has around 88.5% of Sensitivity compared to the other two, which are Decision Tree model that has 79% of Sensitivity and Naive Bayes model that only has 58% of Sensitivity.
Based on those statements, we can conclude that the wisest decision to predict the data set in these case is by using Random Forest model, since it has the highest level of AUC and Sensitivity compare to the other two. Also, based on this model, it can be seen that the most significant variables in predicting customer’s investment decision is duration, age, poutcomesuccess and contacttelephone.