This time, we will learn and make predictions about potential
customers of a bank in Portugal—whether they subscribe to a term deposit
(“Yes”) or not (“No”). To make this prediction, we will create three
models for comparison: Naive Bayes,
Decision Tree, and Random Forest.
The libraries we will use for this analysis are:
#> 'data.frame': 45211 obs. of 17 variables:
#> $ age : int 58 44 33 47 33 35 28 42 58 43 ...
#> $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
#> $ marital : chr "married" "single" "married" "married" ...
#> $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
#> $ housing : chr "yes" "yes" "yes" "yes" ...
#> $ loan : chr "no" "no" "yes" "no" ...
#> $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ day : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ month : chr "may" "may" "may" "may" ...
#> $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
#> $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ y : chr "no" "no" "no" "no" ...
Deskripsi Data
Data terdiri dari 45.211 baris dan 17 kolom - age : Age
of the client - job : Type of job - marital :
Marital status - education : Level of education -
default : Whether the client has credit in default (“yes”,
“no”) - balance : Average yearly balance in euros. -
housing : Whether the client has a housing loan (“yes”,
“no”) - loan : Whether the client has a personal loan
(“yes”, “no”) - contact : Contact communication type -
day : Last contact day of the month - month :
Last contact month - duration : Last contact duration in
seconds - campaign : Number of contacts performed during
this campaign - pday : Number of days since the client was
last contacted (-1 means never contacted) - previous :
Number of contacts performed before this campaign -
poutcome : Outcome of the previous marketing campaign -
y : Whether the client subscribed to a term deposit (“yes”,
“no”)
#> 'data.frame': 45211 obs. of 17 variables:
#> $ age : int 58 44 33 47 33 35 28 42 58 43 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
#> $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
#> $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
#> $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
#> $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
#> $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
#> $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ day : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#> $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
#> $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#> $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#> age job marital education default balance housing loan
#> 0 0 0 0 0 0 0 0
#> contact day month duration campaign pdays previous poutcome
#> 0 0 0 0 0 0 0 0
#> y
#> 0
From the results above, we obtained information that there are no missing values. Therefore, we can proceed to the next step of model building.
Before building the model, we first split the data into 80% training data and 20% testing data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(bank)
index <- sample(row_data, row_data*0.8)
data_train <- bank[ index, ]
data_test <- bank[ -index, ] #>
#> no yes
#> 0.8830152 0.1169848
It turns out that the proportion of the target variable is imbalanced. Therefore, we will take the following steps:
set.seed(123)
data_train_downsample <- downSample(x = data_train %>% select(-y),
y = data_train$y,
list = F,
yname = "y"
)Downsampling is the process of reducing the majority class until its count matches the minority class, achieving a balanced class proportion. In our training data, the “No” class will be sampled and reduced until it equals the number of “Yes” instances.
#>
#> no yes
#> 0.5 0.5
The target variable class proportion is now balanced. Therefore, we proceed to the next step: model building.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6395 284
#> yes 1602 762
#>
#> Accuracy : 0.7914
#> 95% CI : (0.7829, 0.7998)
#> No Information Rate : 0.8843
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3413
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.72849
#> Specificity : 0.79967
#> Pos Pred Value : 0.32234
#> Neg Pred Value : 0.95748
#> Prevalence : 0.11567
#> Detection Rate : 0.08426
#> Detection Prevalence : 0.26142
#> Balanced Accuracy : 0.76408
#>
#> 'Positive' Class : yes
#>
model_dtree <- ctree(formula = y ~ ., data = data_train_downsample)
plot(model_dtree, type = "simple")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6472 141
#> yes 1525 905
#>
#> Accuracy : 0.8158
#> 95% CI : (0.8076, 0.8237)
#> No Information Rate : 0.8843
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4282
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.8652
#> Specificity : 0.8093
#> Pos Pred Value : 0.3724
#> Neg Pred Value : 0.9787
#> Prevalence : 0.1157
#> Detection Rate : 0.1001
#> Detection Prevalence : 0.2687
#> Balanced Accuracy : 0.8373
#>
#> 'Positive' Class : yes
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6676 118
#> yes 1321 928
#>
#> Accuracy : 0.8409
#> 95% CI : (0.8332, 0.8484)
#> No Information Rate : 0.8843
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4814
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.8872
#> Specificity : 0.8348
#> Pos Pred Value : 0.4126
#> Neg Pred Value : 0.9826
#> Prevalence : 0.1157
#> Detection Rate : 0.1026
#> Detection Prevalence : 0.2487
#> Balanced Accuracy : 0.8610
#>
#> 'Positive' Class : yes
#>
eval_nb <- data_frame(Accuracy = conf_nb$overall[1],
Recall = conf_nb$byClass[1],
Specificity = conf_nb$byClass[2],
Precision = conf_nb$byClass[3])
eval_nbBased on the results of the three models (Naive Bayes, Decision
Tree, and Random Forest), the Random Forest model
performs best in making predictions. This is evident from its higher
Accuracy, Recall, Specificity,
and Precision compared to Naive Bayes and Decision
Tree.
However, the downside of the Random Forest model is that it takes longer to train and is difficult to interpret compared to simpler models like Decision Tree and Naive Bayes.