Introduction
The objective of a marketing campaign is to promote a product, generate brand awareness and increase sales. In the banking business, which is a highly competitive market, the campaign conducts frequently and sometimes assigns a significant number of resources or labors. The advertising of the campaign can be done through different types of media or interactive methods, such as video conferences or calling. The campaign mostly involves a high budget. Based on the financial brand, the marketing budget of US bank is approximately at 0.08% from the asset. It means if the average asset size for US banks is 2.9 US billion, thus the advertising cost is approximately 2.1 US million. Therefore, especially today, the data becomes one of the keys to increasing efficiency in the marketing campaign because, with more data, customers’ preferences towards the product can be correctly predicted.
Objective
This project is part of the “learn by building” classification II section. The objectives are to find what variables are important for obtaining a subscriber and to create a model that can correctly predict a potential client’s preference towards the term deposit (product). The models are created by using Naive Bayes, Decision Tree and Random Forest method.
Dataset Information
The dataset is collected from the UCI website. The data is related to marketing campaigns of a Portuguese bank. The campaign is through phone calls. We use “bank-full.csv” dataset, which contains 45,211 rows and 17 variables. Our target variable is y (status of client’s subscription). The input variables are included:
Personal client data :
- Age : age of the client (numeric)
- Job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
- Marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’)
- Education (categorical: ‘secondary’,‘tertiary’,‘primary’, ‘unknown’)
- Balance : balance amount (numeric)
- Default: has credit in default? (categorical: ‘no’, ‘yes’)
- Housing: has housing loan? (categorical: ‘no’, ‘yes’)
- Loan: has personal loan? (categorical: ‘no’, ‘yes’)
Related to the last contact of the current campaign :
- Contact: contact communication type (categorical: ‘cellular’,‘telephone’, ‘unknown’)
- Month: last contact month of year (categorical: ‘Jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
- Day: last contact day of the week (numeric)
- Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes :
- Campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)
- Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- Previous: number of contacts performed before this campaign and for this client (numeric)
- Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’, ‘other’, ‘success’, ‘unknown’)
Output variable (desired target) : y - has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)
Data Preparation
Load Libraries
Load all libraries that used in this project.
library(caret)
library(e1071)
library(dplyr)
library(ggplot2)
library(wesanderson)
library(GGally)
library(partykit)
library(tidyr)
library(tidymodels)
library(tidyverse)
library(skimr)
library(ROCR)
library(pROC)
library(rpart)
library(rattle)
library(rpart.plot)
library(plotly)
library(randomForest)
library(vip)
Read Data
We use “bank-full.csv” for our dataset. We read the dataset by using function read.csv() and change data type characters into a factor by using stringsAsFactors = T. Then, we assign it as a new object called bank.
bank <- read.csv("bank-full.csv", sep = ";", stringsAsFactors = T)
Observe Data
Let’s observe the dataset using the function glimpse().
- There are 45,211 rows and 17 columns.
- No requirement to change the data type.
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
Missing Value and Summary Dataframe
We can check the missing values and summary of dataframe by using functions skim() and partition().
bank %>%
skim() %>%
partition()
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| job | 0 | 1 | FALSE | 12 | blu: 9732, man: 9458, tec: 7597, adm: 5171 |
| marital | 0 | 1 | FALSE | 3 | mar: 27214, sin: 12790, div: 5207 |
| education | 0 | 1 | FALSE | 4 | sec: 23202, ter: 13301, pri: 6851, unk: 1857 |
| default | 0 | 1 | FALSE | 2 | no: 44396, yes: 815 |
| housing | 0 | 1 | FALSE | 2 | yes: 25130, no: 20081 |
| loan | 0 | 1 | FALSE | 2 | no: 37967, yes: 7244 |
| contact | 0 | 1 | FALSE | 3 | cel: 29285, unk: 13020, tel: 2906 |
| month | 0 | 1 | FALSE | 12 | may: 13766, jul: 6895, aug: 6247, jun: 5341 |
| poutcome | 0 | 1 | FALSE | 4 | unk: 36959, fai: 4901, oth: 1840, suc: 1511 |
| y | 0 | 1 | FALSE | 2 | no: 39922, yes: 5289 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.94 | 10.62 | 18 | 33 | 39 | 48 | 95 | ▅▇▃▁▁ |
| balance | 0 | 1 | 1362.27 | 3044.77 | -8019 | 72 | 448 | 1428 | 102127 | ▇▁▁▁▁ |
| day | 0 | 1 | 15.81 | 8.32 | 1 | 8 | 16 | 21 | 31 | ▇▆▇▆▆ |
| duration | 0 | 1 | 258.16 | 257.53 | 0 | 103 | 180 | 319 | 4918 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.76 | 3.10 | 1 | 1 | 2 | 3 | 63 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 40.20 | 100.13 | -1 | -1 | -1 | -1 | 871 | ▇▁▁▁▁ |
| previous | 0 | 1 | 0.58 | 2.30 | 0 | 0 | 0 | 0 | 275 | ▇▁▁▁▁ |
The output shows there are no missing values in dataframe. The dataframe contains 10 categorical variables and 7 numeric variables. In the numerical variables, age, balance, duration, campaign, pdays, previous and previous have a right skew distribution. In other side, it is quite hard to observe the composition in categorical variables. We can use barplot to observe the frequency for categorical variables and the histogram to observe distribution of numerical variables.
Categorical Variables
The below plots show frequency of each categorical variables.
Insight :
- There is a variety in terms of occupations in the database. Almost 40% of the customers have profession as a management and a blue-collar job.
- A half of the clients have a secondary education.
- More than half of the clients do not have personal loan.
- The status of clients are mostly married.
- Approximately 70% of the communication with client related to a campaign is through the mobile phone.
- There is a significant difference in the default class. Most of the customers do not have default. It is normal, since the bank prefers to have business with customers who do not have credit default.
Numerical Variables
The below plots show distribution of each numerical variables.
Insight:
- The majority of the bank’s clients have ages between 25 - 50 and the mean client’s age is 39. Thus, the middle-aged might be a target customer.
- The balance, pdays, campaign and previous have a right-skew distribution as their values are mostly distributed on the low side. Those variables might be a variable with near-zero variance.
Feature Selection
[Near-zero variance means that the] fraction of unique values over the sample size is low (say 10%) […] [and the] ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.
– Kuhn, M., & Johnson, K. (2013). Applied predictive modeling, New York, NY: Springer.
To confirm near zero variance, we can use function nearZeroVar().
rmarkdown::paged_table(as.data.frame(nearZeroVar(bank, saveMetrics = T)))
The table shows that
pdaysanddefaultare indeed near zero variance. Thus, we can drop those variables.As mention in dataset information, “the duration variable highly effects with target variable. Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model”. Based on this information, we also drop this variable.
bank <- bank %>%
dplyr::select (-c(pdays, default, duration))
Correlation
The correlation for numerical variables can be observed by using function ggcorr()
ggcorr(bank,
label = T,
size = 3, hjust = 0.1, color='black', angle=90,
layout.exp = 3,
cex = 3)+
labs(title = 'Correlation Matrix Predictors')+
theme(plot.title = element_text(size=20),
legend.text = element_text(size = 12))
Insight: There is almost no correlation between each variable. Campaign - day and balance - age have a weak positive correlation.
Cross Validation
Split Train-Test Data
We split the dataset into train and test by using function sample(). The train dataset will contain 80% of the bank dataframe. Train data is used to train the model, while the rest is used to test the model.
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create train dataset with 80% of total row in dataframe and the rest is used to test the model.
insample <- sample(nrow(bank), nrow(bank)*0.8)
# Create bank_train as our train data
bank_train <- bank[insample,]
# Create bank_test as our test data
bank_test <- bank[-insample,]
Let’s check the class proportion of y in the train dataset. The class proportion of the variable target is 9: 1, which is considered a significant imbalance.
round(prop.table(table(bank_train$y)),2)
##
## no yes
## 0.88 0.12
Balance Target Class
In this section, we will balance the proportion class in the target variable of the train dataset (bank_train). It is crucial to have a balanced class proportion so that the model can predict well in both classes. There are four methods for balancing class proportions:
- Upper-sampling: The method increases the number of observations from the minority class to balance the classes
- Down-sampling: The method reduces the number of observations from the majority class to balance the classes.
- Both sampling: The method is a combination of over and under-sampling.
- Synthetic data generation: This method overcomes imbalances by generating artificial data. It generates a random set of minority class data to shift the classifier’s learning bias towards the minority class.
Since we don’t want to lose information in our dataset, we use the upper-sampling by using function upSample().
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create a balance target of bank_train and assign as `bank_train_up`
bank_train_up <- upSample(x = bank_train %>% dplyr::select(-y),
y = bank_train$y,
yname= "y")
# Check the proportion of target class of `bank_train_down`.
table(bank_train_up$y)
##
## no yes
## 31932 31932
The above output shows that the target class in bank_train_up is balanced, with the number of samples for each class being 31,932 samples. We will use this train dataset to create the model.
Naive Bayes
Naive Bayes is an algorithm of simple “probabilistic classifiers” based on Bayes’ theorem. The model considers independence assumptions between predictor variables. The model is easy to be used and fast to build a real-time prediction. It can handle both discrete and continuous target variables. However, the model suffers from skewness due to data scarcity.
Modelling
We create a Naive Bayes model by using function naiveBayes() with bank_train_up as our data and add laplace =1 to prevent skewness.
model_naive <- naiveBayes(y~., bank_train_up, laplace = 1)
Prediction
The prediction can be generated by using function predict() with bank_test as our newdata.
pred_naive <- predict(object = model_naive, newdata=bank_test, type="class")
Model Evaluation
ROC and AUC
The empirical ROC curve is a probability curve showing the true positive rate (sensitivity) versus the false positive rate (1 - specificity) for all possible cut-off values. The AUC (Area Under Curve) measures how much the model is capable to distinguish between classes. The higher the AUC value, the better the model is at distinguishing positive and negative classes.
# Create a probability prediction from `model_naive`
prob_naive <- predict(object = model_naive, newdata=bank_test, type="raw")
# Create a prob and label from prob_naive.
roc_bank <- data.frame(prob=prob_naive[,2],
label=as.numeric(bank_test$y=="yes"))
# Create an object prediction
prediction_roc_bank <- prediction(predictions = roc_bank$prob,
labels = roc_bank$label)
# Create an ROC plot
plot(ROCR::performance(prediction.obj = prediction_roc_bank,
measure = "tpr",
x.measure = "fpr"),main = "ROC Naive Bayes", col="#519259")
abline(a = 0, b = 1)
# Obtain the AUC
auc_naive <- ROCR::performance(prediction.obj=prediction_roc_bank, measure = "auc")
auc_naive@y.values[[1]]
## [1] 0.7515969
Insight: The AUC of the model is 0.7516 (AUC is between 0.5 and 1), implying that the model can distinguish between positive and negative classes.
Confusion Matrix
The confusion matrix is used to describe the performance of a classification algorithm. Four metrics to evaluate classifiers are Accuracy, Sensitivity, Specificity and Precision.
- Accuracy is defined as the ratio of correctly predicted cases by the total cases.
\[ Accuracy = \frac{TP+TN}{TP + TN +FP+FN} \]
- Precision or Pos Pred Value describes how many of the correctly predicted cases turned out to be positive. This metric determines whether our model is reliable or not.
\[ Precision = \frac{TP}{TP + FP} \]
- Recall or Sensitivity describes how many of the actual positive cases we were able to predict correctly with our model.
\[ Recall = \frac{TP}{TP + FN} \]
- Specificity describes how many of the actual negative cases we are able to predict correctly with our model.
\[ Specificity = \frac{TN}{TN + FP} \]
Insight: In this project, the expected model is a model that can correctly predict the customer subscription of term deposit product. The metric to be used in this project is
accuracy, where the model is good to classify customer’s preferences based on several given predictors. However, depending on the objective of the bank, the other metrics can also be used as preferences. Theprecisionvalue can be used when the bank wants to improve the effectiveness of the campaign by reducing the calling time and labours cost. Thus, the bank’s staff will give more attention to the interested customers, instead of non-interested customers. Thesensitivityvalue can be used when the bank wants to obtain as many customers so that the bank does not lose an opportunity or revenue.
Let’s check the summary of the confusion matrix for pred_naive by using function confusionMatrix(). We define yes as positive class.
con_naive_test <- confusionMatrix(data= pred_naive, reference = bank_test$y, positive="yes")
con_naive_test
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 5534 332
## yes 2456 721
##
## Accuracy : 0.6917
## 95% CI : (0.6821, 0.7012)
## No Information Rate : 0.8836
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2012
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.68471
## Specificity : 0.69262
## Pos Pred Value : 0.22694
## Neg Pred Value : 0.94340
## Prevalence : 0.11644
## Detection Rate : 0.07973
## Detection Prevalence : 0.35132
## Balanced Accuracy : 0.68866
##
## 'Positive' Class : yes
##
Insights:
- The confusion matrix shows that the model using the Naive Bayes method has an accuracy value of 0.6917. It implies that the model can predict 69.17% of the test data correctly.
- The precision value shows 0.2269 (low value), implying that only 22.69% of the correctly predicted cases turned out to be positive. It also means that the model is bad at classifying the positive predicted cases.
- The sensitivity and specificity values are 68.47% and 69.26%, respectively. It implies that actual positive and negative classes are correctly classified.
Over or Underfitting
Underfitting is a condition where the model is unable to capture accurately the relationship between the input and output of variables. This condition generates a high error rate in both train and test datasets. It mostly happens when the model is too simple, thus the model required more input variables or less regularization. Overfitting is the condition where the model has too much complexity, generating a high error on the test dataset and resulting in a model with low bias but high variance.
We will observe our model for under and overfitting conditions. This can be done by comparing the performance metrics of train and test data.
# Prediction for train_data
pred_naive_train <- predict(object=model_naive, newdata=bank_train_up, type="class")
# confusion matrix data train
con_naive_train <- confusionMatrix(pred_naive_train,
bank_train_up$y,
positive="yes")
Insight:
- The accuracy, sensitivity and specificity values in both model are relatively close (optimum). Thus, we can accept those value in this model.
- In other side, the precision in model test is significant lower than in model train. This means the model can be classified as an overfitted model in term of precision value.
Performance Trade-Off
We can improve our model by observing the performance trade-off. First, we create a probability prediction of Naive Bayes model by using type=raw. Then, create a dataframe naive_table contains the model prediction (bank_pred), the probability prediction of negative class (bank_eprob) and the probability prediction of positive class(bank_pprob).
The performance trade-off plot code follows the code from this link
# Probability prediction of Naive Bayes model.
prob_naive <- predict(object = model_naive, newdata=bank_test, type="raw")
# Create dataframe contains a model prediction (bank_pred), probability prediction of negative class (bank_eprob) and probability of positive class (bank_pprob)
naive_table <- dplyr::select(bank_test, y) %>%
bind_cols(bank_pred = pred_naive) %>%
bind_cols(bank_eprob = round(prob_naive[,1],4)) %>%
bind_cols(bank_pprob = round(prob_naive[,2],4))
Create a loop to observe performance trade off.
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = naive_table$bank_pprob,
ref = naive_table$y,
postarget = "yes",
negtarget = "no")
}
plot_naive <- data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
theme_set(theme_minimal() + theme(text = element_text(family="Arial Narrow"))) +
scale_color_manual(values = wes_palette(4, name = "Darjeeling1", type = "continuous")) +
labs(title = "Performance Trade-Off") +
theme(legend.position = "top",
plot.title = element_text(size= 17, color = 'black', face ='bold'),
axis.title.x = element_text(size=14, color = 'black'),
axis.title.y = element_text(size = 14, color = 'black'),
axis.text.x = element_text(size = 12, color = 'black'),
axis.text.y = element_text(size = 12, color = 'black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(plot_naive)
Insight:
- From the plot, the intersection point between sensitivity and specificity is at 0.49 cut-offs (probability threshold) or almost at the default threshold.
- At this point, it will not provide the highest accuracy. However, it will provide an unbiased accuracy metric value. This means that we do not need to change our threshold.
- The plot also suggests that there is a trade-off between the sensitivity and specificity values. If the threshold is less than 0.49, we would obtain higher sensitivity and lower specificity, vice versa, with a threshold between 0.49 and 0.80.
Decision Tree
Based on Wikipedia, decision tree is the most popular machine learning algorithms due to its simplicity and intelligibility. Decision tree can be used to predict when the target variable is the class (discrete) or a real number (numeric). The classification tree structures consists of leaves, which represent class labels, and branches, which represent connection of variables that lead to those class labels. The main idea of the decision tree is to obtain variables that returns the highest information gains or the most homogeneous branches.
Modelling and Pruning
Now, we create a decision tree model by using rpart() and we set the cp value at 0, maxdepth at 5, minsplit at 5 and minibucket at 5 so that the tree can grow. A short explanation for each item:
- Minsplit: the minimum number of samples that must exist in a node for a split to proceed.
- Maxdepth; The maximum depth at the final tree.
- Cp (complexity parameter): The objective of this parameter is to save computing time by pruning off splits that are not worthwhile.
set.seed(100)
# Create a decision tree model
model_dt <- rpart(formula = y ~.,
data = bank_train_up,
method = "class",
control = rpart.control(cp = 0, maxdepth =5,
minsplit = 5, minbucket = 10))
We can visualize the decision tree by using the function rpart.plot()
# Visualize the decision tree
rpart.plot(model_dt, type=1, sub = NULL)
The model is considered a complex tree because it has too many branches and layers. It can result in overfitting of training data. We can validate it by using pruning. The objective is to reduce the size of the decision tree by removing the part of the tree that are redundant and not important. There are two methods in pruning, which are pre-pruning and post-pruning. In this project, we focus on post-pruning, which allows the tree to classify the training set perfectly and then prunes the tree.
In this section, we try to find the optimum complexity parameter (cp). As mentioned above, this parameter is to control the size of the decision tree. If the value of cp is less than the cost of adding other variables to the tree, then the tree stops growing. We set the original tree to its maximum depth (cp = 0) so that the tree can grow. Then, we can find the optimum value of cp by observing the output from plotcp and printcp functions. The information consists of the number of splits, the mean of cross-validation error (xerror), and the standard deviation of cross-validation error (xstd).
# Plot relative cross validation error and cp
plotcp(model_dt)
# Observe the cross-validation error, cp and standard deviation
printcp(model_dt)
##
## Classification tree:
## rpart(formula = y ~ ., data = bank_train_up, method = "class",
## control = rpart.control(cp = 0, maxdepth = 5, minsplit = 5,
## minbucket = 10))
##
## Variables actually used in tree construction:
## [1] age balance campaign contact day education housing
## [8] job month poutcome previous
##
## Root node error: 31932/63864 = 0.5
##
## n= 63864
##
## CP nsplit rel error xerror xstd
## 1 1.7180e-01 0 1.00000 1.00902 0.0039569
## 2 1.3494e-01 1 0.82820 0.82820 0.0038982
## 3 1.9745e-02 2 0.69325 0.69260 0.0037655
## 4 6.0128e-03 5 0.62887 0.62996 0.0036762
## 5 2.0669e-03 8 0.60760 0.60616 0.0036372
## 6 3.7580e-04 9 0.60554 0.60388 0.0036334
## 7 3.2882e-04 10 0.60516 0.60309 0.0036320
## 8 6.2633e-05 12 0.60450 0.60294 0.0036318
## 9 3.1317e-05 13 0.60444 0.60303 0.0036319
## 10 2.0878e-05 17 0.60432 0.60287 0.0036316
## 11 0.0000e+00 20 0.60425 0.60306 0.0036320
The common way to find the optimum cp is using the best tree which has the lowest mean cross-validation error (xerror) or the simplest tree within one standard error (xstd) of the best tree error.
Based on the output above, the best tree is in row 10 (17 splits) because this tree has the lowest xerror (0.60287). However, the tree in row 5 (8 splits) does effectively the similar job because the xerror in row 5 (0.60616) is within one standard error of the best tree xerror (0.60287 + 0.0036316 = 0.6065016). Based on this explanation, the optimum complexity parameter is 0.002066892.
Now, we create the prune model by using function prune() from model_dt with cp set at 0.002066892.
set.seed(100)
# Create a prune model based on cp_otimal
model_dt_prune <- prune(tree=model_dt, cp=0.002066892)
# Plot the model prune
rpart.plot(model_dt_prune, type=1, sub = NULL)
Insight : With pruning, the decision tree model is much simpler and easy to understand. The putcome becomes a root node. There are 8 interior nodes and 10 terminal nodes. Thus, we will use
model_dt_pruneas our decision tree model.
Prediction
The prediction can be generated by using function predict() with bank_test as our newdata.
pred_dt <-predict(object=model_dt_prune, newdata=bank_test, type="class")
Model Evaluation
ROC and AUC
# Create a probability prediction from `model_dt`
prob_dt <- predict(object = model_dt_prune, newdata=bank_test, type="prob")
# Create a prob and label from prob_dt.
roc_dt <- data.frame(prob=prob_dt[,2],
label=as.numeric(bank_test$y=="yes"))
# Create an object prediction
prediction_roc_dt <- prediction(predictions = roc_dt$prob,
labels = roc_dt$label)
# Create an ROC plot
plot(ROCR::performance(prediction.obj = prediction_roc_dt,
measure = "tpr",
x.measure = "fpr"),main = "ROC Decision Tree", col="#519259", print.auc=T)
abline(a = 0, b = 1)
# Obtain the AUC
auc_dt <- ROCR::performance(prediction.obj=prediction_roc_dt, measure = "auc")
auc_dt@y.values[[1]]
## [1] 0.7332124
Insight: The AUC of the model is 0.7332 (AUC is between 0.5 and 1), implying that the model can distinguish between positive and negative classes.
Confusion Matrix
con_dt_test <- confusionMatrix(data=pred_dt, reference=bank_test$y, positive="yes")
con_dt_test
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7415 595
## yes 575 458
##
## Accuracy : 0.8706
## 95% CI : (0.8635, 0.8775)
## No Information Rate : 0.8836
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.366
##
## Mcnemar's Test P-Value : 0.5786
##
## Sensitivity : 0.43495
## Specificity : 0.92804
## Pos Pred Value : 0.44337
## Neg Pred Value : 0.92572
## Prevalence : 0.11644
## Detection Rate : 0.05065
## Detection Prevalence : 0.11423
## Balanced Accuracy : 0.68149
##
## 'Positive' Class : yes
##
Insights:
- The accuracy value of 0.8706, impliying that the model can predict 87.06% of the test data correctly.
- The precision value shows 0.4433, implying that 44.33% of the correctly predicted cases turned out to be positive. It also means that the model is bad at classifying the positive predicted cases.
- The sensitivity and specificity values are 43.49% and 92.81%, respectively. It implies that most actual negative classes are correctly classified, as opposed to the positive class.
Over or Underfitting
We will observe our model for under and overfitting conditions. This can be done by comparing the performance metrics of train and test data.
# Model prediction based on data train
pred_dt_train <- predict(object=model_dt_prune, newdata=bank_train_up, type="class")
# Confusion matrix data train
con_dt_train <- confusionMatrix(pred_dt_train,
bank_train_up$y,
positive="yes")
Insight: In terms of accuracy, sensitivity and specificity, the model is relatively optimum. However, if we depend on the precision value, the model will be categorized as an overfitted model because the precision value in test data is significantly lower than in train data.
Performance Trade-Off
We will try to tune the model by adjusting the threshold of probability of model prediction.
Now, we can observe detailed metrics from this model and compare the results with the model using the default threshold.
# Classify a probability value into target classes.
dt_table <- dt_table %>%
mutate(tuning_pred = as.factor(ifelse(bank_pprob >= 0.43, "yes", "no")))
con_dt_adj <- confusionMatrix(dt_table$tuning_pred, bank_test$y, positive="yes")
Insight:
- As illustrated in the plot, the intersection between specificity and sensitivity at 0.43 has a similar performance metrics value with model with threshold at 0.50. Thus, we can use our performance metrics with threshold 0.43.
- The plot also suggests that there is a trade-off between the sensitivity and specificity values. If the threshold is less than 0.43, we would obtain higher sensitivity and lower specificity, vice versa, with a threshold between 0.43 and 0.80.
Random Forest
Random Forest classifier consists of many decision trees that operate as an ensemble method. Each decision tree produces a class prediction. The majority of the votes becomes the model prediction. The concept behind the random forest is bagging (bootstrap and aggregation) and feature randomness when constructing each decision tree to create an uncorrelated forest of the tree.
Modelling
We create a random forest model by using function train() with method="rf". The K-fold cross validation is used in this model. K-fold cross-validation splits the dataset into a K number of the fold, which each fold is used as a testing set at some point. To use K-fold, we need to specify the function trainControl() with method="repeatedcv". The number =5 means that there will be five folds or sets of split data.
Since the running time of the random forest model is quite long, it is wise to save the model into an RDS file.
#set.seed(100)
# Use k-fold cross-validation
#ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
# Create a random forest model
#forest_up <- train(y ~., data=bank_train_up, method= "rf", trControl = ctrl)
# Save the model into RDS file
#saveRDS(forest_up, "forest_up.RDS")
After running the model and saved it into model_rf, we can read the file by using function readRDS() and observe the model.
# Read the forest_origin and assign it as a new abject called `model_rf`
model_rf <- readRDS("forest_up.RDS")
Interpretation
We can obtain the error of out-of-bag or unseen data and class error by calling model_rf$finalModel. Then, we can plot the model to observe the optimum mtry.
# Observe the model
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), trainControl = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 20
##
## OOB estimate of error rate: 2.25%
## Confusion matrix:
## no yes class.error
## no 30496 1436 0.0449705624
## yes 2 31930 0.0000626331
# Plot the model
plot(model_rf)
Insight :
- The output shows that the out-off-bag error in the model_rf is 2.25%, implying that the model has a high accuracy value of 97.75%.
- Based on the plot, the model with mtry = 20 has the highest accuracy value when tested on data from bootstrap sampling.
We can also obtain the most important variables in the random forest model by using function varImp().
varImp(model_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 39)
##
## Overall
## balance 100.000
## age 72.992
## day 63.851
## poutcomesuccess 31.910
## campaign 31.381
## contactunknown 22.392
## previous 14.094
## housingyes 13.791
## loanyes 8.052
## educationsecondary 7.915
## jobtechnician 7.354
## jobblue-collar 7.189
## monthaug 6.993
## maritalmarried 6.889
## monthjul 6.868
## educationtertiary 6.791
## jobmanagement 6.612
## monthnov 5.368
## maritalsingle 5.223
## poutcomeunknown 5.143
Insight : The table above indicates that the most important predictor variable for y is balance, followed by age and day.
Prediction
The prediction can be generated by using function predict() with bank_test as our newdata.
pred_rf <-predict(object=model_rf, newdata=bank_test)
Model Evaluation
ROC and AUC
# Create a probability prediction from `model_rf`
prob_rf <- predict(object = model_rf, newdata=bank_test, type="prob")
# Create a prob and label from prob_rf.
roc_rf <- data.frame(prob=prob_rf[,2],
label=as.numeric(bank_test$y=="yes"))
# Create an object prediction
prediction_roc_rf <- prediction(predictions = roc_rf$prob,
labels = roc_rf$label)
# Create an ROC plot
plot(ROCR::performance(prediction.obj = prediction_roc_rf,
measure = "tpr",
x.measure = "fpr"),main = "ROC Random Forest", col="#519259", print.auc=T)
abline(a = 0, b = 1)
# Obtain the AUC
auc_rf <- ROCR::performance(prediction.obj=prediction_roc_rf, measure = "auc")
auc_rf@y.values[[1]]
## [1] 0.7766193
Insight: The AUC of the model is 0.776 (AUC is between 0.5 and 1), implying that the model can distinguish between positive and negative classes.
Confusion Matrix
con_rf_test <- confusionMatrix(as.factor(pred_rf), bank_test$y, positive="yes")
con_rf_test
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7640 701
## yes 350 352
##
## Accuracy : 0.8838
## 95% CI : (0.877, 0.8903)
## No Information Rate : 0.8836
## P-Value [Acc > NIR] : 0.4821
##
## Kappa : 0.3396
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.33428
## Specificity : 0.95620
## Pos Pred Value : 0.50142
## Neg Pred Value : 0.91596
## Prevalence : 0.11644
## Detection Rate : 0.03893
## Detection Prevalence : 0.07763
## Balanced Accuracy : 0.64524
##
## 'Positive' Class : yes
##
Insights:
- The confusion matrix shows that the model using the random Forest method has an accuracy value of 0.8838, impliying that the model can predict 88.38% of the test data correctly.
- The precision value shows 0.5014, implying that 50.14% of the correctly predicted cases turned out to be positive.
- The sensitivity and specificity values are 33.4% and 95.62%, respectively. It implies that most negative classes are correctly classified, as opposed to the positive class.
Over or Underfitting
We will observe our model for under and overfitting conditions. This can be done by comparing the performance metrics of train and test data.
# Model prediction based on data train
pred_rf_train <- predict(object=model_rf, newdata=bank_train_up, type="prob")
# confusion matrix data train
con_rf_train <- confusionMatrix(pred_dt_train,
bank_train_up$y,
positive="yes")
Insight : The model has a high accuracy and specificity values in test data. However, the sensitivity and precision values in train data are higher than in test data. It implies that the model is an overfitted in term of those values.
Performance Trade-Off
we will try to tune the model by adjusting the threshold of probability of model prediction.
Now, we can observe detailed metrics from this model and compare the results with the model using the default threshold.
# Classify a probability value into target classes.
rf_table <- rf_table %>%
mutate(tuning_pred = as.factor(ifelse(bank_pprob >= 0.4, "yes", "no")))
# Create confusion matrix
con_rf_adj <- confusionMatrix(rf_table$tuning_pred, bank_test$y, positive="yes")
Insight: As illustrated in the plot, the intersection between precision and sensitivity at 0.40 leads to a lower value of accuracy, specificity and precision. However, the trade-off is that value of the sensitivity will increase. We will choose the model with default threshold as this model produces a high value of accuracy.
Conclusion
Below table shows the comparison between three models.
Insight:
- The summary table shows that the model with Random Forest using K-fold cross-validation and Decision Tree produce a high values of accuracy. It implies that this models are appropriate to be used to correctly predict the customer’s preference toward the term deposit (product).
- All models have a relatively low precision value. It implies that the models bad at minimizing false positives. Since the model with Random Forest has the highest value of precision, this model might be used to reduce the labour cost by avoiding calling uninterested customers. However, some improvement is acquired to increase the value.
- If the objective of the bank is to increase the number of customers and does not want to lose any opportunities, the model with Naive Bayes is the appropriate model to be used. It is confirmed by the highest value of sensitivity.
- To be concluded, the model with Random Forest with K-fold cross-validation is the best model for our project objective, confirmed by the highest values of accuracy and AUC.