In determining whether a customer wants to subscribe to a time
deposit or not is the problem we will solve. In this case we will use 3
types of machine learning with predictors that are either numeric or
categorical. The target of our prediction is column y which
contains yes or no, which means that our target is whether to subscribe
to a time deposit (yes) or not (no). We will compare the two types of
methods and will conclude which method is the best to use for this
prediction.
library(dplyr)
library(caret)
library(e1071)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)Import the data that we have prepared, namely bank.csv.
Use the read.csv command according to the file extension
bank <- read.csv("bank.csv")Let’s take a quick look at the data content with the Head() command
head(bank)We check the data type with the glimpse() command.
glimpse(bank)#> Rows: 4,521
#> Columns: 17
#> $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
#> $ job <chr> "unemployed", "services", "management", "management", "blue-…
#> $ marital <chr> "married", "married", "single", "married", "married", "singl…
#> $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
#> $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
#> $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
#> $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
#> $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
#> $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
#> $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
#> $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
#> $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
#> $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
#> $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
#> $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
#> $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
#> $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
From the glimps function above, we can see that the data has 4521 rows and 17 columns. Here is the explanation of the variables:
age :(numeric)job :type of jobmarital :marital statuseducation :categorical:
“unknown”,“secondary”,“primary”,“tertiary”default :has credit in default? (categorical:
‘no’,‘yes’,‘unknown’)balance :average yearly balance, in euros
(numeric)housing :has housing loan? (binary: “yes”,“no”)loan :has personal loan? (binary: “yes”,“no”)contact :contact communication type (categorical:
‘cellular’,‘telephone’)day :last contact day of the month (numeric)month :last contact month of year (categorical: “jan”,
“feb”, “mar”, …, “nov”, “dec”)duration :last contact duration, in seconds
(numeric)campaign :number of contacts performed during this
campaign and for this clientpdays :number of days that passed by after the client
was last contacted from a previous campaignprevious :number of contacts performed before this
campaign and for this client (numeric)poutcome :outcome of the previous marketing
campaigny :has the client subscribed a term deposit? (binary:
“yes”,“no”)We will change the data as it should be
bank_clean <- bank %>%
mutate_at(vars(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor)
glimpse(bank_clean)#> Rows: 4,521
#> Columns: 17
#> $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
#> $ job <fct> unemployed, services, management, management, blue-collar, m…
#> $ marital <fct> married, married, single, married, married, single, married,…
#> $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertiary,…
#> $ default <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
#> $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
#> $ housing <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, no…
#> $ loan <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no, y…
#> $ contact <fct> cellular, cellular, cellular, unknown, unknown, cellular, ce…
#> $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
#> $ month <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, apr, …
#> $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
#> $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
#> $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
#> $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
#> $ poutcome <fct> unknown, failure, failure, unknown, unknown, failure, other,…
#> $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no,…
Check Missing values
colSums(is.na(bank_clean))#> age job marital education default balance housing loan
#> 0 0 0 0 0 0 0 0
#> contact day month duration campaign pdays previous poutcome
#> 0 0 0 0 0 0 0 0
#> y
#> 0
Check the distribution/pattern of the data
summary(bank_clean)#> age job marital education default
#> Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
#> 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
#> Median :39.00 technician :768 single :1196 tertiary :1350
#> Mean :41.17 admin. :478 unknown : 187
#> 3rd Qu.:49.00 services :417
#> Max. :87.00 retired :230
#> (Other) :713
#> balance housing loan contact day
#> Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
#> 1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
#> Median : 444 unknown :1324 Median :16.00
#> Mean : 1423 Mean :15.92
#> 3rd Qu.: 1480 3rd Qu.:21.00
#> Max. :71188 Max. :31.00
#>
#> month duration campaign pdays
#> may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
#> jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
#> aug : 633 Median : 185 Median : 2.000 Median : -1.00
#> jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
#> nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
#> apr : 293 Max. :3025 Max. :50.000 Max. :871.00
#> (Other): 571
#> previous poutcome y
#> Min. : 0.0000 failure: 490 no :4000
#> 1st Qu.: 0.0000 other : 197 yes: 521
#> Median : 0.0000 success: 129
#> Mean : 0.5426 unknown:3705
#> 3rd Qu.: 0.0000
#> Max. :25.0000
#>
Insight:
age min. 19 and max 87job management is most high with value 969 and at least
retired with value 230martial married is most high vith value 2797 and at
least divorced with value 528education secondary is most high vith value 2306 and at
least unknown with value 187default no is most high vith value 4445 and at least
unknown with value 76balance min.-3313 and max 71188housing yes is most high vith value 2559 and at least
unknown with value 1962loan no is most high vith value 3830 and at least
unknown with value 691contact celluler is most high vith value 2896 and at
least telephone with value 301day min.1 and max 31month may is most high at 1398duration min.4 and max 3025campaign min.1 and max 50pdays min.-1 and max 871previous min.0 and max 25poutcome min.129 and max 3705contact yes is most high vith value 4000 and at least
yes with value 521We will split the train data with the test data
RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here
index_bank <- sample(nrow(bank_clean), nrow(bank_clean)*0.80)
bank_train <- bank_clean[index_bank,] # untuk pelatihan
bank_test <- bank_clean[-index_bank,] # untuk predictcheck bank_train’s proportions with its target
prop.table(table(bank_train$y))#>
#> no yes
#> 0.8821903 0.1178097
Target proportion is not balanced
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
bank_train_up <- upSample(x = bank_train %>% select(-y),
y = bank_train$y,
yname = "y")Check the target proportion
prop.table(table(bank_train_up$y))#>
#> no yes
#> 0.5 0.5
Target is in balance
Naive Bayes is a classification method that uses Bayes’ theorem. Bayes’ theorem states that the probability of an event can change if there is new information..
# train
model_nb_bank <- naiveBayes(y~., bank_train_up, laplace = 1)Predict class from test data with function predict():
# predict class
bank_test$pred_label <- predict(object = model_nb_bank,
newdata=bank_test,
type="class") Model evaluation with confusion matrix:
con_bank_naive <- confusionMatrix(data = bank_test$pred_label, reference=bank_test$y, positive = "yes")
con_bank_naive#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 625 25
#> yes 185 70
#>
#> Accuracy : 0.768
#> 95% CI : (0.7391, 0.7951)
#> No Information Rate : 0.895
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2917
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.73684
#> Specificity : 0.77160
#> Pos Pred Value : 0.27451
#> Neg Pred Value : 0.96154
#> Prevalence : 0.10497
#> Detection Rate : 0.07735
#> Detection Prevalence : 0.28177
#> Balanced Accuracy : 0.75422
#>
#> 'Positive' Class : yes
#>
ROC is a curve that describes the relationship between True Positive Rate and False Positive Rate at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate. AUC shows the area under the ROC curve. The closer to 1, the better the model’s performance in separating positive and negative classes.
we construct the ROC curve of the model model_nb_vote. First we make a prediction in the form of probability.
# ambil hasil prediksi data test dalam bentuk probability
bank_test$pred <- predict(model_nb_bank, bank_test, type="raw")Prepare the data frame for ROC (actually optional, but it makes
things easier). We assume the positive class is yes.
# menyiapkan actual dalam bentuk 1 & 0
bank_test$actual <- ifelse(bank_test$y == "yes", yes = 1, no = 0)Set up prediction() object, calculate TPR & FPR with performance() function, then create ROC curve with plot().
# objek prediction
bank_roc_pred <- prediction(predictions = bank_test$pred[,1], # prediksi yes dalam peluang
labels = bank_test$actual) # label asli dalam bentuk 1 & 0
# ROC curve
plot(performance(prediction.obj = bank_roc_pred, "tpr", "fpr"))
abline(0,1, lty=2)# nilai AUC
auc_pred <- performance(prediction.obj = bank_roc_pred, "auc")
auc_pred@y.values # tanda @ untuk mengakases nilai dari object auc_pred#> [[1]]
#> [1] 0.1896296
AUC = 0.1896296, then it can be concluded that our model is not good
at separating the yes and no classes.
Decision trees are a type of supervised machine learning algorithm that are used for both classification and regression tasks. They work by creating a tree-like structure where each node represents a decision, each branch represents a possible outcome, and each leaf node represents a final prediction. To create a Decision Tree model, the ctree() function from the partykit library can be used.
bank_tree <- ctree(formula = y ~ ., data = bank_train_up,
control = ctree_control(mincriterion = 0.95,
minsplit = 100,
minbucket = 80))
plot(bank_tree, type='simple')Decision trees have the characteristic of overfitting, so we set the accuracy between train and test evaluations to be maximum. 15%
# prediksi kelas di data train
pred_train <- predict(bank_tree, bank_train_up, type="response")
# confusion matrix data train
tree_con_train <- confusionMatrix(pred_train, bank_train_up$y, positive = "yes")
tree_con_train#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 2467 282
#> yes 723 2908
#>
#> Accuracy : 0.8425
#> 95% CI : (0.8333, 0.8513)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.685
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9116
#> Specificity : 0.7734
#> Pos Pred Value : 0.8009
#> Neg Pred Value : 0.8974
#> Prevalence : 0.5000
#> Detection Rate : 0.4558
#> Detection Prevalence : 0.5691
#> Balanced Accuracy : 0.8425
#>
#> 'Positive' Class : yes
#>
# prediksi kelas di data test
pred_test <- predict(bank_tree, bank_test, type="response")
# confusion matrix data test
tree_con_test <- confusionMatrix(pred_test, bank_test$y, positive = "yes")
tree_con_test#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 615 12
#> yes 195 83
#>
#> Accuracy : 0.7713
#> 95% CI : (0.7425, 0.7983)
#> No Information Rate : 0.895
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3421
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.87368
#> Specificity : 0.75926
#> Pos Pred Value : 0.29856
#> Neg Pred Value : 0.98086
#> Prevalence : 0.10497
#> Detection Rate : 0.09171
#> Detection Prevalence : 0.30718
#> Balanced Accuracy : 0.81647
#>
#> 'Positive' Class : yes
#>
From Sensitivity we can see the difference in value is less than 15%
Random Forest is one of the most popular and powerful machine learning algorithms. It falls under the category of ensemble learning, which means it combines several simpler models (in this case, decision trees) to produce a more accurate and stable model. Create a Random Forest model using bank_train_up with 3-fold cross validation, then the process is repeated 2 times.
set.seed(417)
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k-fold
repeats = 3) # repetisi
bank_forest <- train(y ~ .,
data = bank_train_up,
method = "rf", # random forest
trControl = ctrl)We will save it in RDS form
saveRDS(bank_forest, file = "bank_forest.RDS")We’ll call our model
bank_forest_f <- readRDS("bank_forest.RDS")
bank_forest_f#> Random Forest
#>
#> 6380 samples
#> 16 predictor
#> 2 classes: 'no', 'yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 5104, 5104, 5104, 5104, 5104, 5104, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8874608 0.7749216
#> 22 0.9680773 0.9361546
#> 42 0.9641066 0.9282132
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 22.
In the Bootstrap sampling stage, there is data that is not used in modeling, this is referred to as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as data to evaluate by calculating the error (similar to test data). This error is called OOB error. In the case of classification, OOB error is the percentage of OOB data that is misclassified.
bank_forest_f$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 22
#>
#> OOB estimate of error rate: 2.62%
#> Confusion matrix:
#> no yes class.error
#> no 3023 167 0.0523511
#> yes 0 3190 0.0000000
The OOB Error value for the bank_forest_f model is 2.62%. In other words, the model accuracy on OOB data is 97.38%.
Although random forest is labeled as an uninterpretable model, at least we can see what predictors are most used (important) in making random forest:
varImp(bank_forest_f) %>% plot()From the plot above, we can conclude that the duration predictor has the greatest influence
bank_pred_rf <- predict(bank_forest_f, bank_test)plot(bank_pred_rf)(conf_matrix_bank_rfor <- table(bank_pred_rf, bank_test$y))#>
#> bank_pred_rf no yes
#> no 766 51
#> yes 44 44
con_bank_rf <- confusionMatrix(conf_matrix_bank_rfor, positive = "yes")
con_bank_rf#> Confusion Matrix and Statistics
#>
#>
#> bank_pred_rf no yes
#> no 766 51
#> yes 44 44
#>
#> Accuracy : 0.895
#> 95% CI : (0.8732, 0.9142)
#> No Information Rate : 0.895
#> P-Value [Acc > NIR] : 0.5273
#>
#> Kappa : 0.4226
#>
#> Mcnemar's Test P-Value : 0.5382
#>
#> Sensitivity : 0.46316
#> Specificity : 0.94568
#> Pos Pred Value : 0.50000
#> Neg Pred Value : 0.93758
#> Prevalence : 0.10497
#> Detection Rate : 0.04862
#> Detection Prevalence : 0.09724
#> Balanced Accuracy : 0.70442
#>
#> 'Positive' Class : yes
#>
eval_bank_naiv <- data_frame(Accuracy = con_bank_naive$overall[1],
Recall = con_bank_naive$byClass[1],
Specificity = con_bank_naive$byClass[2],
Precision = con_bank_naive$byClass[3])
eval_bank_tree <- data_frame(Accuracy = tree_con_test$overall[1],
Recall = tree_con_test$byClass[1],
Specificity = tree_con_test$byClass[2],
Precision = tree_con_test$byClass[3])
eval_bank_rf <- data_frame(Accuracy = con_bank_rf$overall[1],
Recall = con_bank_rf$byClass[1],
Specificity = con_bank_rf$byClass[2],
Precision = con_bank_rf$byClass[3])eval_bank_naiveval_bank_treeeval_bank_rfOf the 3 methods above, each has its own advantages when viewed from the matrix
Our positive class is yes, which means the customer has subscribed to a time deposit, while the negative class is no, which means the customer has not subscribed to a time deposit. FP: predicting a customer subscribed to a time deposit (yes), while the customer did not subscribe to a time deposit, the bank’s risk is that the bank incurs a loss. FN: predicting the customer does not subscribe to a time deposit (no), even though the customer subscribes to a time deposit, the bank risks losing profits. From the bank’s side, the concerning risk is FN so the matrix we use is Recall. From the three machine learning methods above, if we are concerned according to the matrix, we will use the machine learning type decision tree which has a Recall of 87.4%.