The objective of this analysis is to predict potential clients who are willing to subscribe banking term deposit product offered by a Portugese Bank Instituion during 2008 - 2010 through telemarketing channel using an algorithm model. The model built here will use Naieve Bayes, Decision Tree, and Random Forest methods.
Source of data: Bank Marketing Data Set that can be accessed on https://archive.ics.uci.edu/ml/datasets/bank+marketing
Here are the libraries used in this analysis
library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(e1071)
library(ROCR)
library(partykit)
library(randomForest)
library(rattle)
library(GGally)
library(car)
library(ROSE)
library(rpart)Here is the used data.
bank <- read.csv("bank-full.csv", sep=";")
glimpse(bank)## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "~
## $ marital <chr> "married", "single", "married", "married", "single", "marrie~
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", ~
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",~
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"~
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"~
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may~
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
Column Description:
Target Variable:
There will be adjusted in the dataset:
bank <- bank %>% mutate(job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
default = as.factor(default),
loan = as.factor(loan),
housing = as.factor(housing),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome),
y = as.factor(y))head(bank)colSums(is.na(bank))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
From checking dataset above, there is no missing value.
To build Naive Bayes Model, it is important to check correlation among the data, since it assumes that each predictor is independent and hos no correlation to each other.
ggcorr(bank, label_size = 3, hjust = 1, layout.exp = 2, label = T)
Based on the plot above, overall predictors within the data set have no
correlation, except for previous and age.
Before setting a model, it needs to split the dataset to “train” and “test”. The propotion of data will be categorized into “train” is 80%, while “test” is 20%. Here is the splitting process.
RNGkind(sample.kind = "Rounding")
set.seed(192)
index <- sample(nrow(bank), size = nrow(bank)*0.8)
bank.train <- bank[index, ]
bank.test <- bank[-index, ]Before builing a model, it is needed to check the propotion of the target variable.
prop.table(table(bank.train$y))##
## no yes
## 0.8828246 0.1171754
According to the propotion check, there is imbalance issue within the data with the comparion 88:11. Therefore, to solve it, “Downsampling” Method will be used.
set.seed(192)
train.down <- ovun.sample(y~.,
data = bank.train,
method = "under")$dataprop.table(table(train.down$y))##
## no yes
## 0.5012944 0.4987056
After Upsampling has been completed, here is the updated propotion result. The result is balance with 0.501: 0.498. Therefore, the process can be continued.
Here is the naive bayes model process. In addition, the process also uses laplace smoothing to avoid skewness due to scarcity to ensure that there is no predictor with value is 0. The laplace value inputted is 1.
model.nb <- naiveBayes(y~., train.down, laplace = 1)
model.nb ##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## no yes
## 0.5012944 0.4987056
##
## Conditional probabilities:
## age
## Y [,1] [,2]
## no 40.90423 10.28181
## yes 41.68240 13.54187
##
## job
## Y admin. blue-collar entrepreneur housemaid management retired
## no 0.108380150 0.224719101 0.032069288 0.031132959 0.200374532 0.046114232
## yes 0.120470588 0.136705882 0.021647059 0.019764706 0.240941176 0.098117647
## job
## Y self-employed services student technician unemployed unknown
## no 0.039559925 0.095739700 0.018960674 0.166198502 0.027153558 0.009597378
## yes 0.036235294 0.071529412 0.051529412 0.156705882 0.039058824 0.007294118
##
## marital
## Y divorced married single
## no 0.1168191 0.6040347 0.2791461
## yes 0.1211978 0.5147371 0.3640651
##
## education
## Y primary secondary tertiary unknown
## no 0.15173546 0.52884615 0.27954972 0.03986867
## yes 0.10843942 0.46605375 0.37670910 0.04879774
##
## default
## Y no yes
## no 0.98122947 0.01877053
## yes 0.98962264 0.01037736
##
## balance
## Y [,1] [,2]
## no 1307.341 3283.647
## yes 1768.836 3333.038
##
## housing
## Y no yes
## no 0.4246832 0.5753168
## yes 0.6379717 0.3620283
##
## loan
## Y no yes
## no 0.83317691 0.16682309
## yes 0.90872642 0.09127358
##
## contact
## Y cellular telephone unknown
## no 0.62631949 0.06873094 0.30494957
## yes 0.82739920 0.07120962 0.10139118
##
## day
## Y [,1] [,2]
## no 15.93685 8.252132
## yes 15.14370 8.473055
##
## month
## Y apr aug dec feb jan jul
## no 0.066713483 0.143492509 0.002574906 0.056179775 0.031132959 0.154026217
## yes 0.104941176 0.131764706 0.021176471 0.084235294 0.026823529 0.120470588
## month
## Y jun mar may nov oct sep
## no 0.109784644 0.006320225 0.325842697 0.085205993 0.011001873 0.007724719
## yes 0.103764706 0.047058824 0.172705882 0.073882353 0.060705882 0.052470588
##
## duration
## Y [,1] [,2]
## no 222.4960 209.0754
## yes 540.0144 392.5904
##
## campaign
## Y [,1] [,2]
## no 2.791549 2.995923
## yes 2.120104 1.799999
##
## pdays
## Y [,1] [,2]
## no 35.60141 95.41756
## yes 68.09863 119.41539
##
## previous
## Y [,1] [,2]
## no 0.4467136 1.496878
## yes 1.1772062 2.643383
##
## poutcome
## Y failure other success unknown
## no 0.10647280 0.03799250 0.01383677 0.84169794
## yes 0.11621876 0.05964168 0.18057520 0.64356436
Here is to predict the data test using Naive Bayes process.
pred.nb <- predict(object = model.nb,
newdata = bank.test,
type = "class")To evaluate the model performance will be used confusion matrix, as it is proceeded below.
confusionMatrix(data = pred.nb,
reference = bank.test$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6561 292
## yes 1431 759
##
## Accuracy : 0.8095
## 95% CI : (0.8012, 0.8175)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3693
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72217
## Specificity : 0.82095
## Pos Pred Value : 0.34658
## Neg Pred Value : 0.95739
## Prevalence : 0.11622
## Detection Rate : 0.08393
## Detection Prevalence : 0.24218
## Balanced Accuracy : 0.77156
##
## 'Positive' Class : yes
##
According to confusionMatrix above, the result shows:
Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.
Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.
pred.nb.prb <- predict(model.nb, newdata = bank.test, type = "raw")
roc.bank.nb <- data.frame(pred.prop = pred.nb.prb [,2],
label = as.numeric(bank.test$y == "yes"))
pred.roc.nb <- prediction(predictions = roc.bank.nb$pred.prop,
labels = roc.bank.nb$label)
plot(performance(pred.roc.nb, "tpr", "fpr"))auc.bank.nb <- performance(pred.roc.nb, "auc")
auc.bank.nb@y.values## [[1]]
## [1] 0.8504087
Based on ROC-AUC result, it shows that with value of 0.85 of 1.00 indicates that the model can identify positive and negative classes pretty well.
Here is the decision tree process
model.dt <- ctree(y~., data = train.down)Here is to predict the data test and data train
#data test
pred.dt.test <- predict(object = model.dt,
newdata = bank.test)
#data train
pred.dt.train <- predict(object = model.dt,
newdata = bank.train)To evaluate the model performance will be used confusion matrix, as it is proceeded below.
#data test
confusionMatrix(data = pred.dt.test,
reference = bank.test$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6577 151
## yes 1415 900
##
## Accuracy : 0.8268
## 95% CI : (0.8189, 0.8346)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4462
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.85633
## Specificity : 0.82295
## Pos Pred Value : 0.38877
## Neg Pred Value : 0.97756
## Prevalence : 0.11622
## Detection Rate : 0.09952
## Detection Prevalence : 0.25600
## Balanced Accuracy : 0.83964
##
## 'Positive' Class : yes
##
According to confusionMatrix from data train above, the result shows:
#data train
confusionMatrix(data = pred.dt.train,
reference = bank.train$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 26243 508
## yes 5687 3730
##
## Accuracy : 0.8287
## 95% CI : (0.8248, 0.8326)
## No Information Rate : 0.8828
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4589
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8801
## Specificity : 0.8219
## Pos Pred Value : 0.3961
## Neg Pred Value : 0.9810
## Prevalence : 0.1172
## Detection Rate : 0.1031
## Detection Prevalence : 0.2604
## Balanced Accuracy : 0.8510
##
## 'Positive' Class : yes
##
According to confusionMatrix from data test above, the result shows:
Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.
According to confusion matrix from data train and data test with each test’s accuracy is 82% and sensitivity are 85% and 88% shows that there is no over-fitted phenomenon with this model.
Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.
pred.dt.prb <- predict(model.dt, newdata = bank.test, type = "prob")
roc.bank.dt <- data.frame(pred.prop = pred.dt.prb [,2],
label = as.numeric(bank.test$y == "yes"))
pred.roc.dt <- prediction(predictions = roc.bank.dt$pred.prop,
labels = roc.bank.dt$label)
plot(performance(pred.roc.dt, "tpr", "fpr"))auc.bank.dt <- performance(pred.roc.dt, "auc")
auc.bank.dt@y.values## [[1]]
## [1] 0.9057526
Based on ROC-AUC result, it shows that with value of 0.90 of 1.00 indicates that the model can identify positive and negative classes well.
Here is the random forest process.
In random forest method we will delete less information data in the predictor column that its variance close to 0 to reduce processing time.
nz.var.train <- nearZeroVar(train.down)
nz.var.test <- nearZeroVar(bank.test)
no.var.train <- train.down[, -nz.var.train]
no.var.test <- bank.test[, -nz.var.test]glimpse(no.var.train)## Rows: 8,498
## Columns: 15
## $ age <int> 56, 48, 55, 29, 45, 60, 60, 56, 37, 34, 35, 51, 63, 33, 33, ~
## $ job <fct> technician, management, retired, blue-collar, technician, en~
## $ marital <fct> divorced, divorced, married, married, married, married, divo~
## $ education <fct> unknown, tertiary, secondary, primary, secondary, primary, s~
## $ balance <int> 56, 2939, 1126, 50, 268, 2408, 162, 9, 1624, 416, 451, 303, ~
## $ housing <fct> yes, no, yes, no, no, no, no, no, yes, no, yes, yes, no, yes~
## $ loan <fct> no, no, no, no, no, no, no, yes, no, no, yes, no, no, yes, n~
## $ contact <fct> unknown, cellular, telephone, cellular, cellular, telephone,~
## $ day <int> 5, 22, 28, 8, 7, 20, 27, 20, 20, 14, 12, 3, 16, 9, 2, 30, 11~
## $ month <fct> may, aug, jul, jul, jul, apr, may, aug, nov, aug, may, jun, ~
## $ duration <int> 439, 91, 388, 185, 96, 102, 74, 285, 559, 1217, 247, 87, 57,~
## $ campaign <int> 1, 4, 2, 2, 14, 4, 1, 2, 2, 4, 1, 7, 4, 3, 1, 3, 3, 1, 22, 4~
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, ~
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow~
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, ~
Here is to model building uaing traincontrol selection to select predictors automatically and randomly in random forest with k-fold 4 times.
control <- trainControl(method = "repeatedcv", number = 5, repeats = 2)model.rf <- train(y~., data = no.var.train, method = "rf", trControl = control)
saveRDS(model.rf, "bank_RF.RDS")bank.model.rf <- readRDS("bank_RF.RDS")
glimpse(bank.model.rf)## List of 25
## $ method : chr "rf"
## $ modelInfo :List of 15
## ..$ label : chr "Random Forest"
## ..$ library : chr "randomForest"
## ..$ loop : NULL
## ..$ type : chr [1:2] "Classification" "Regression"
## ..$ parameters:'data.frame': 1 obs. of 3 variables:
## .. ..$ parameter: chr "mtry"
## .. ..$ class : chr "numeric"
## .. ..$ label : chr "#Randomly Selected Predictors"
## ..$ grid :function (x, y, len = NULL, search = "grid")
## ..$ fit :function (x, y, wts, param, lev, last, classProbs, ...)
## ..$ predict :function (modelFit, newdata, submodels = NULL)
## ..$ prob :function (modelFit, newdata, submodels = NULL)
## ..$ predictors:function (x, ...)
## ..$ varImp :function (object, ...)
## ..$ levels :function (x)
## ..$ tags : chr [1:4] "Random Forest" "Ensemble Model" "Bagging" "Implicit Feature Selection"
## ..$ sort :function (x)
## ..$ oob :function (x)
## $ modelType : chr "Classification"
## $ results :'data.frame': 3 obs. of 5 variables:
## ..$ mtry : num [1:3] 2 21 40
## ..$ Accuracy : num [1:3] 0.803 0.852 0.851
## ..$ Kappa : num [1:3] 0.605 0.705 0.701
## ..$ AccuracySD: num [1:3] 0.01232 0.00818 0.00804
## ..$ KappaSD : num [1:3] 0.0247 0.0163 0.0161
## $ pred : NULL
## $ bestTune :'data.frame': 1 obs. of 1 variable:
## ..$ mtry: num 21
## $ call : language train.formula(form = y ~ ., data = no.var.train, method = "rf", trControl = control)
## $ dots : list()
## $ metric : chr "Accuracy"
## $ control :List of 27
## ..$ method : chr "repeatedcv"
## ..$ number : num 5
## ..$ repeats : num 2
## ..$ search : chr "grid"
## ..$ p : num 0.75
## ..$ initialWindow : NULL
## ..$ horizon : num 1
## ..$ fixedWindow : logi TRUE
## ..$ skip : num 0
## ..$ verboseIter : logi FALSE
## ..$ returnData : logi TRUE
## ..$ returnResamp : chr "final"
## ..$ savePredictions : chr "none"
## ..$ classProbs : logi FALSE
## ..$ summaryFunction :function (data, lev = NULL, model = NULL)
## ..$ selectionFunction: chr "best"
## ..$ preProcOptions :List of 6
## .. ..$ thresh : num 0.95
## .. ..$ ICAcomp : num 3
## .. ..$ k : num 5
## .. ..$ freqCut : num 19
## .. ..$ uniqueCut: num 10
## .. ..$ cutoff : num 0.9
## ..$ sampling : NULL
## ..$ index :List of 10
## .. ..$ Fold1.Rep1: int [1:6799] 1 2 3 5 6 7 8 9 10 11 ...
## .. ..$ Fold2.Rep1: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold3.Rep1: int [1:6799] 2 4 5 6 8 10 11 12 13 15 ...
## .. ..$ Fold4.Rep1: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold5.Rep1: int [1:6798] 1 3 4 7 9 12 13 14 15 16 ...
## .. ..$ Fold1.Rep2: int [1:6798] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold2.Rep2: int [1:6798] 1 5 6 7 8 9 10 11 13 14 ...
## .. ..$ Fold3.Rep2: int [1:6799] 2 3 4 6 7 8 10 11 12 13 ...
## .. ..$ Fold4.Rep2: int [1:6798] 1 2 3 4 5 7 8 9 12 13 ...
## .. ..$ Fold5.Rep2: int [1:6799] 1 2 3 4 5 6 9 10 11 12 ...
## ..$ indexOut :List of 10
## .. ..$ Resample01: int [1:1699] 4 15 19 21 26 27 33 34 45 49 ...
## .. ..$ Resample02: int [1:1700] 20 22 25 32 35 41 46 47 63 66 ...
## .. ..$ Resample03: int [1:1699] 1 3 7 9 14 17 29 31 36 38 ...
## .. ..$ Resample04: int [1:1700] 12 13 16 18 23 24 28 39 43 53 ...
## .. ..$ Resample05: int [1:1700] 2 5 6 8 10 11 30 37 40 42 ...
## .. ..$ Resample06: int [1:1700] 13 14 16 22 33 36 37 39 41 42 ...
## .. ..$ Resample07: int [1:1700] 2 3 4 12 18 19 25 40 53 58 ...
## .. ..$ Resample08: int [1:1699] 1 5 9 20 27 43 46 49 51 55 ...
## .. ..$ Resample09: int [1:1700] 6 10 11 17 21 26 30 32 34 44 ...
## .. ..$ Resample10: int [1:1699] 7 8 15 23 24 28 29 31 35 38 ...
## ..$ indexFinal : NULL
## ..$ timingSamps : num 0
## ..$ predictionBounds : logi [1:2] FALSE FALSE
## ..$ seeds :List of 11
## .. ..$ : int [1:3] 945873 423799 898085
## .. ..$ : int [1:3] 250379 364676 54093
## .. ..$ : int [1:3] 170850 365471 474005
## .. ..$ : int [1:3] 665620 765413 363035
## .. ..$ : int [1:3] 375028 602099 625898
## .. ..$ : int [1:3] 475448 545467 355686
## .. ..$ : int [1:3] 267900 211683 496212
## .. ..$ : int [1:3] 559824 414995 506401
## .. ..$ : int [1:3] 392221 911793 800844
## .. ..$ : int [1:3] 78949 693462 636605
## .. ..$ : int 315343
## ..$ adaptive :List of 4
## .. ..$ min : num 5
## .. ..$ alpha : num 0.05
## .. ..$ method : chr "gls"
## .. ..$ complete: logi TRUE
## ..$ trim : logi FALSE
## ..$ allowParallel : logi TRUE
## $ finalModel :List of 23
## ..$ call : language randomForest(x = x, y = y, mtry = param$mtry)
## ..$ type : chr "classification"
## ..$ predicted : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
## .. ..- attr(*, "names")= chr [1:8498] "X1" "X2" "X3" "X4" ...
## ..$ err.rate : num [1:500, 1:3] 0.223 0.232 0.224 0.221 0.218 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ confusion : num [1:2, 1:3] 3489 453 771 3785 0.181 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ votes : 'matrix' num [1:8498, 1:2] 0.726 0.961 0.452 0.915 0.949 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ oob.times : num [1:8498] 186 180 186 188 175 169 180 182 184 186 ...
## ..$ classes : chr [1:2] "no" "yes"
## ..$ importance : num [1:40, 1] 309.46 35.95 9.14 9.7 29.87 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ importanceSD : NULL
## ..$ localImportance: NULL
## ..$ proximity : NULL
## ..$ ntree : num 500
## ..$ mtry : num 21
## ..$ forest :List of 14
## .. ..$ ndbigtree : int [1:500] 1915 1945 1863 1857 1885 1785 1903 1967 1909 1897 ...
## .. ..$ nodestatus: int [1:2017, 1:500] 1 1 1 1 1 1 1 1 1 -1 ...
## .. ..$ bestvar : int [1:2017, 1:500] 35 30 35 35 1 39 22 1 40 0 ...
## .. ..$ treemap : int [1:2017, 1:2, 1:500] 2 4 6 8 10 12 14 16 18 0 ...
## .. ..$ nodepred : int [1:2017, 1:500] 0 0 0 0 0 0 0 0 0 1 ...
## .. ..$ xbestsplit: num [1:2017, 1:500] 207.5 0.5 448.5 124.5 26 ...
## .. ..$ pid : num [1:2] 1 1
## .. ..$ cutoff : num [1:2] 0.5 0.5
## .. ..$ ncat : Named num [1:40] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. ..- attr(*, "names")= chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
## .. ..$ maxcat : num 1
## .. ..$ nrnodes : int 2017
## .. ..$ ntree : num 500
## .. ..$ nclass : int 2
## .. ..$ xlevels :List of 40
## ..$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:8498] "1" "2" "3" "4" ...
## ..$ test : NULL
## ..$ inbag : NULL
## ..$ xNames : chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
## ..$ problemType : chr "Classification"
## ..$ tuneValue :'data.frame': 1 obs. of 1 variable:
## .. ..$ mtry: num 21
## ..$ obsLevels : chr [1:2] "no" "yes"
## .. ..- attr(*, "ordered")= logi FALSE
## ..$ param : list()
## ..- attr(*, "class")= chr "randomForest"
## $ preProcess : NULL
## $ trainingData:'data.frame': 8498 obs. of 15 variables:
## ..$ .outcome : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ age : int [1:8498] 56 48 55 29 45 60 60 56 37 34 ...
## ..$ job : Factor w/ 12 levels "admin.","blue-collar",..: 10 5 6 2 10 3 1 8 7 5 ...
## ..$ marital : Factor w/ 3 levels "divorced","married",..: 1 1 2 2 2 2 1 2 2 3 ...
## ..$ education: Factor w/ 4 levels "primary","secondary",..: 4 3 2 1 2 1 2 2 3 3 ...
## ..$ balance : int [1:8498] 56 2939 1126 50 268 2408 162 9 1624 416 ...
## ..$ housing : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 2 1 ...
## ..$ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## ..$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 1 2 1 1 2 3 1 1 1 ...
## ..$ day : int [1:8498] 5 22 28 8 7 20 27 20 20 14 ...
## ..$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 2 6 6 6 1 9 2 10 2 ...
## ..$ duration : int [1:8498] 439 91 388 185 96 102 74 285 559 1217 ...
## ..$ campaign : int [1:8498] 1 4 2 2 14 4 1 2 2 4 ...
## ..$ previous : int [1:8498] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ ptype :'data.frame': 0 obs. of 14 variables:
## ..$ age : int(0)
## ..$ job : Factor w/ 12 levels "admin.","blue-collar",..:
## ..$ marital : Factor w/ 3 levels "divorced","married",..:
## ..$ education: Factor w/ 4 levels "primary","secondary",..:
## ..$ balance : int(0)
## ..$ housing : Factor w/ 2 levels "no","yes":
## ..$ loan : Factor w/ 2 levels "no","yes":
## ..$ contact : Factor w/ 3 levels "cellular","telephone",..:
## ..$ day : int(0)
## ..$ month : Factor w/ 12 levels "apr","aug","dec",..:
## ..$ duration : int(0)
## ..$ campaign : int(0)
## ..$ previous : int(0)
## ..$ poutcome : Factor w/ 4 levels "failure","other",..:
## $ resample :'data.frame': 10 obs. of 3 variables:
## ..$ Accuracy: num [1:10] 0.861 0.83 0.852 0.857 0.854 ...
## ..$ Kappa : num [1:10] 0.721 0.661 0.705 0.714 0.708 ...
## ..$ Resample: chr [1:10] "Fold2.Rep1" "Fold1.Rep1" "Fold4.Rep1" "Fold3.Rep1" ...
## $ resampledCM :'data.frame': 30 obs. of 6 variables:
## ..$ cell1 : num [1:30] 691 669 672 713 688 686 729 695 698 719 ...
## ..$ cell2 : num [1:30] 161 183 180 139 164 166 123 157 154 133 ...
## ..$ cell3 : num [1:30] 209 105 109 194 73 77 186 86 91 210 ...
## ..$ cell4 : num [1:30] 638 742 738 654 775 771 661 761 756 638 ...
## ..$ mtry : num [1:30] 2 21 40 2 21 40 2 21 40 2 ...
## ..$ Resample: chr [1:30] "Fold1.Rep1" "Fold1.Rep1" "Fold1.Rep1" "Fold2.Rep1" ...
## $ perfNames : chr [1:2] "Accuracy" "Kappa"
## $ maximize : logi TRUE
## $ yLimits : NULL
## $ times :List of 3
## ..$ everything: 'proc_time' Named num [1:5] 451.9 12.5 475.2 NA NA
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ final : 'proc_time' Named num [1:5] 20.93 0.35 21.93 NA NA
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ prediction: logi [1:3] NA NA NA
## $ levels : chr [1:2] "no" "yes"
## ..- attr(*, "ordered")= logi FALSE
## $ terms :Classes 'terms', 'formula' language y ~ age + job + marital + education + balance + housing + loan + contact + day + month + duration + campaign| __truncated__
## .. ..- attr(*, "variables")= language list(y, age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign, previous, poutcome)
## .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. ..- attr(*, "term.labels")= chr [1:14] "age" "job" "marital" "education" ...
## .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign, previous, poutcome)
## .. ..- attr(*, "dataClasses")= Named chr [1:15] "factor" "numeric" "factor" "factor" ...
## .. .. ..- attr(*, "names")= chr [1:15] "y" "age" "job" "marital" ...
## $ coefnames : chr [1:40] "age" "jobblue-collar" "jobentrepreneur" "jobhousemaid" ...
## $ contrasts :List of 8
## ..$ job : chr "contr.treatment"
## ..$ marital : chr "contr.treatment"
## ..$ education: chr "contr.treatment"
## ..$ housing : chr "contr.treatment"
## ..$ loan : chr "contr.treatment"
## ..$ contact : chr "contr.treatment"
## ..$ month : chr "contr.treatment"
## ..$ poutcome : chr "contr.treatment"
## $ xlevels :List of 8
## ..$ job : chr [1:12] "admin." "blue-collar" "entrepreneur" "housemaid" ...
## ..$ marital : chr [1:3] "divorced" "married" "single"
## ..$ education: chr [1:4] "primary" "secondary" "tertiary" "unknown"
## ..$ housing : chr [1:2] "no" "yes"
## ..$ loan : chr [1:2] "no" "yes"
## ..$ contact : chr [1:3] "cellular" "telephone" "unknown"
## ..$ month : chr [1:12] "apr" "aug" "dec" "feb" ...
## ..$ poutcome : chr [1:4] "failure" "other" "success" "unknown"
## - attr(*, "class")= chr [1:2] "train" "train.formula"
#data test
pred.rf.test <- predict(object = bank.model.rf,
newdata = bank.test)To evaluate the model performance will be used out-of-bag errors and confusion matrix, as it is proceeded below.
Here is to check out-of-bag errors
bank.model.rf$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 14.4%
## Confusion matrix:
## no yes class.error
## no 3489 771 0.1809859
## yes 453 3785 0.1068900
Out-of-bag errors of this model shows accuracy
Here is to check its confusionmatrix
#data test
confusionMatrix(data = pred.rf.test,
reference = bank.test$y,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6580 112
## yes 1412 939
##
## Accuracy : 0.8315
## 95% CI : (0.8236, 0.8391)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4663
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8934
## Specificity : 0.8233
## Pos Pred Value : 0.3994
## Neg Pred Value : 0.9833
## Prevalence : 0.1162
## Detection Rate : 0.1038
## Detection Prevalence : 0.2600
## Balanced Accuracy : 0.8584
##
## 'Positive' Class : yes
##
According to confusionMatrix from data test above, the result shows:
Since the objective of this analysis would like to predict where positive customer are willing to deposit (yes), therefore, it uses “Recall-Sensitivity” as the parameter.
According to confusion matrix from data train and data test with each test’s accuracy is 82% and sensitivity are 85% and 88% shows that there is no over-fitted phenomenon with this model.
Besides to measure whether the model can identify positive and negative classes, it uses also Receiver Operationg Curve (ROC) and Area Under Curve (AUC) as it is proceeded below.
pred.rf.prb <- predict(bank.model.rf, newdata = bank.test, type = "prob")
roc.bank.rf <- data.frame(pred.prop = pred.rf.prb [,2],
label = as.numeric(bank.test$y == "yes"))
pred.roc.rf <- prediction(predictions = roc.bank.rf$pred.prop,
labels = roc.bank.rf$label)
plot(performance(pred.roc.rf, "tpr", "fpr"))auc.bank.rf <- performance(pred.roc.rf, "auc")
auc.bank.rf@y.values## [[1]]
## [1] 0.9254217
Based on ROC-AUC result, it shows that with value of 0.92 of 1.00 indicates that the model can identify positive and negative classes well.
Based on three model above, in overall performance, Random Forest Model has the best performance compared to Decision Tree Model and Naive Bayes Model with its accuracy value is 83,15%, sensitivity is 88,87%, and auc is 92,48%.