We will learn to use classification model using bank dataset. We want
to know the relationship among variables, especially between the
y(has the client subscribed a term deposit?) with other
variables. We also want to predict the future channely`
based on the historical data.
In the following code we would borrow from a dataset prepared by UCI Machine Learning Repository and available on the UCI Machine Learning repository
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
Attribute Information:
Load the required package
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(inspectdf)
library(tidymodels)
options(scipen = 100, max.print = 1e+06)# Read the dataset in, drop the "Region" feature because it's not interesting
bank <- read.csv("data_input2/bank.csv", sep =";" )glimpse(bank)## Rows: 4,521
## Columns: 17
## $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
The data has 4,521 rows and 17 columns.
First, we check missing value
anyNA(bank)## [1] FALSE
Our data has no missing value.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
x<- inspectdf::inspect_cat(bank)
show_plot(x)Based on the plot, there is an unbalanced proportion between the levels in our target variable. To avoid the loss of variance, I will use upsampling (rather than downsampling) to balance the proportion.
GGally::ggcorr(bank, hjust = 1, layout.exp = 2, label = T, label_size = 2.9)Based on the plot above, there are no predictor variables which have a high correlation with one another.
set.seed(417)
split <- initial_split(data = bank, prop = 0.8, strata = "y")
train <- training(split)
test <- testing(split)bank_recipe <- recipe(y~., train) %>%
themis::step_upsample(y, seed = 417) %>%
step_nzv(all_predictors()) %>%
prep()
bank_train <- juice(bank_recipe)
bank_test <- bake(bank_recipe, test) There are certain characteristics of Naive Bayes that should be considered:
We are still going to try using Naive Bayes and the result will be compared with the other models. While building our Naive Bayes model, we should also apply Laplace estimator.
library(e1071)
# model building
naive <- naiveBayes(y~. , bank_train, laplace = 1)# model fitting
naive_pred <- predict(naive, bank_test, type = "class") # for the class prediction
naive_prob <- predict(naive, bank_test, type = "raw") # for the probability
# result
naive_table <- select(bank_test, y) %>%
bind_cols(y_pred = naive_pred) %>%
bind_cols(y_eprob = round(naive_prob[,1],4)) %>%
bind_cols(y_pprob = round(naive_prob[,2],4))
# performance evaluation - confusion matrix
naive_table %>%
conf_mat(y, y_pred) %>%
autoplot(type = "heatmap")naive_table %>%
summarise(
accuracy = accuracy_vec(y, y_pred),
sensitivity = sens_vec(y, y_pred),
specificity = spec_vec(y, y_pred),
precision = precision_vec(y, y_pred)
)## # A tibble: 1 × 4
## accuracy sensitivity specificity precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.822 0.83 0.762 0.964
# ROC
naive_roc <- data.frame(prediction=round(naive_prob[,1],4),
trueclass=as.numeric(naive_table$y=="no"))
head(naive_roc)## prediction trueclass
## 1 0.2643 1
## 2 0.9636 1
## 3 0.9680 1
## 4 0.6033 1
## 5 0.9261 1
## 6 0.9803 1
library(ROCR)
naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_n <- performance(naive_roc, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n## [1] 0.8688929
Below, I will demonstrate some model tuning practice by changing the threshold for classification.
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# model tuning - metrics function
metrics <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = naive_table$y_eprob,
ref = as.factor(ifelse(naive_table$y == "no", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))# tuning final model
naive_table <- naive_table %>%
mutate(tuning_pred = as.factor(ifelse(y_eprob >= 0.445, "no", "yes")))
# metrics result
final_n <- naive_table %>%
summarise(
accuracy = accuracy_vec(y, tuning_pred),
sensitivity = sens_vec(y, tuning_pred),
specificity = spec_vec(y, tuning_pred),
precision = precision_vec(y, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_n)
final_n## accuracy sensitivity specificity precision AUC
## 1 0.8430939 0.85625 0.7428571 0.9620787 0.8688929
Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree.
To really understand what it looks like, let’s make a decision tree
model based on our bank_train data. We can use
rpart package for building a decision tree model and use
rattle and rpart.plot package to visualize
them.
library(rpart)##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
library(rattle)## Warning: package 'rattle' was built under R version 4.2.3
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)## Warning: package 'rpart.plot' was built under R version 4.2.3
# model building
dtree <- rpart(formula = y ~ ., data = bank_train, method = "class")
fancyRpartPlot(dtree, sub = NULL)# model fitting
dtree_pred <- predict(dtree, bank_test, type = "class")
dtree_prob <- predict(dtree, bank_test, type = "prob")
# result
dtree_table <- select(bank_test, y) %>%
bind_cols(y_pred = dtree_pred) %>%
bind_cols(y_eprob = round(dtree_prob[,1],4)) %>%
bind_cols(y_pprob = round(dtree_prob[,2],4))
# performance evaluation - confusion matrix
dtree_table %>%
conf_mat(y, y_pred) %>%
autoplot(type = "heatmap")dtree_table %>%
summarise(
accuracy = accuracy_vec(y, y_pred),
sensitivity = sens_vec(y, y_pred),
specificity = spec_vec(y, y_pred),
precision = precision_vec(y, y_pred)
)## # A tibble: 1 × 4
## accuracy sensitivity specificity precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.797 0.789 0.857 0.977
# ROC
dtree_roc <- data.frame(prediction=round(dtree_prob[,1],4),
trueclass=as.numeric(dtree_table$y=="no"))
head(dtree_roc)## prediction trueclass
## 1 0.1859 1
## 2 0.9374 1
## 3 0.9443 1
## 4 0.3426 1
## 5 0.9374 1
## 6 0.9374 1
dtree_roc <- ROCR::prediction(dtree_roc$prediction, dtree_roc$trueclass)
# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_d <- performance(dtree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d## [1] 0.8462798
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = dtree_table$y_eprob,
ref = as.factor(ifelse(dtree_table$y == "no", 1, 0)),
postarget = "1",
negtarget = "0")
}## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
ggplotly(data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
# tuning final model
dtree_table <- dtree_table %>%
mutate(tuning_pred = as.factor(ifelse(y_eprob >= 0.7, "no", "yes")))
# metrics result
final_d <- dtree_table %>%
summarise(
accuracy = accuracy_vec(y, tuning_pred),
sensitivity = sens_vec(y, tuning_pred),
specificity = spec_vec(y, tuning_pred),
precision = precision_vec(y, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_d)
final_d## accuracy sensitivity specificity precision AUC
## 1 0.7966851 0.78875 0.8571429 0.9767802 0.8462798
To really understand how random forest work, let’s apply the random forest algorithm to our bank data.
# model building
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
forest <- train(y ~ ., data=bank_train, method="rf", trControl = ctrl)
forest## Random Forest
##
## 6400 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 4800, 4800, 4800, 4800, 4800, 4800, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8912500 0.7825000
## 21 0.9636328 0.9272656
## 40 0.9583594 0.9167188
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.
From the model summary, we know that the optimum number of variables
considered for splitting at each tree node is 2. We can also inspect the
importance of each variable that was used in our random forest using
varImp().
varImp(forest)## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## duration 100.000
## balance 19.652
## age 19.582
## day 17.927
## poutcomesuccess 13.776
## contactunknown 10.752
## campaign 7.365
## monthoct 6.415
## previous 5.604
## monthmar 4.847
## monthjul 4.728
## housingyes 2.881
## jobstudent 2.814
## monthjun 2.629
## monthmay 2.594
## poutcomeunknown 2.583
## jobblue-collar 2.391
## monthnov 2.353
## monthfeb 2.270
## educationtertiary 2.073
plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 2.77%
## Confusion matrix:
## no yes class.error
## no 3023 177 0.0553125
## yes 0 3200 0.0000000
Let’s test our random forest model to our bank_test dataset:
# model fitting
forest_pred <- predict(forest, bank_test, type = "raw")
forest_prob <- predict(forest, bank_test, type = "prob")
# result
forest_table <- select(bank_test, y) %>%
bind_cols(y_pred = forest_pred) %>%
bind_cols(y_eprob = round(forest_prob[,1],4)) %>%
bind_cols(y_pprob = round(forest_prob[,2],4))
# performance evaluation - confusion matrix
forest_table %>%
conf_mat(y, y_pred) %>%
autoplot(type = "heatmap")forest_table %>%
summarise(
accuracy = accuracy_vec(y, y_pred),
sensitivity = sens_vec(y, y_pred),
specificity = spec_vec(y, y_pred),
precision = precision_vec(y, y_pred)
)## # A tibble: 1 × 4
## accuracy sensitivity specificity precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.888 0.946 0.448 0.929
# ROC
forest_roc <- data.frame(prediction=round(forest_prob[,1],4),
trueclass=as.numeric(forest_table$y=="no"))
head(forest_roc)## prediction trueclass
## 1 0.700 1
## 2 0.944 1
## 3 0.998 1
## 4 0.752 1
## 5 0.994 1
## 6 1.000 1
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)
# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f## [1] 0.8945595
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = forest_table$y_eprob,
ref = as.factor(ifelse(forest_table$y == "no", 1, 0)),
postarget = "1",
negtarget = "0")
}## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
ggplotly(data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))# tuning final model
forest_table <- forest_table %>%
mutate(tuning_pred = as.factor(ifelse(y_eprob >= 0.5, "no", "yes")))
# metrics result
final_f <- forest_table %>%
summarise(
accuracy = accuracy_vec(y, tuning_pred),
sensitivity = sens_vec(y, tuning_pred),
specificity = spec_vec(y, tuning_pred),
precision = precision_vec(y, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_f)
final_f## accuracy sensitivity specificity precision AUC
## 1 0.8872928 0.94625 0.4380952 0.9276961 0.8945595
rbind("Naive Bayes" = final_n, "Decision Tree" = final_d, "Random Forest" = final_f)## accuracy sensitivity specificity precision AUC
## Naive Bayes 0.8430939 0.85625 0.7428571 0.9620787 0.8688929
## Decision Tree 0.7966851 0.78875 0.8571429 0.9767802 0.8462798
## Random Forest 0.8872928 0.94625 0.4380952 0.9276961 0.8945595
Based on the metrics table above, the predictive model built using
Random Forest algorithm gave the best result. The model
gave highest accuracy 90.16% while also maintain sensitivity, and
precision above 90%. It also gave the highest AUC at 90.75%. Therefore
the best model to predict bank’s subscribed to deposit is the Random
Forest model.