The objective of this project is to build classifiers to predict whether the client will subscribe (yes/no) to a term deposit at a Portuguese bank using the data from May 2008 to November 2010. Usually the financial institutions like banks spend time in exploring customers that may incur a positive result i.e those who would subscribe to the term deposit. However, with the change in the current paradigm of marketing strategies, research these days is more focussed on turning negatives into positives.Hence, in order to make this project interesting and unconventional,we concentrated on finding the correct statistic of people not subscribing to term deposit.This would allow efficient policy planning and survival against competitors.
The data set used is sourced from the UCI Machine Learning Repository[1]. In Phase I, we performed data preprocessing and exploration. In Phase II, we are building three different ML algorithms on the prepared data. Overall, the report is organised as follows:
Section 1: Introduction
Section 2: Overview of the methodology
Section 3: Splitting the dataset
Section 4: Discussion of the classifers, fine-tunning process and their detailed performance analysis
Section 5: Comparison of each classifier’s performance
Section 6: Critical Analysis of the Approach
Section 7: Conclusion
Section 8: Reference
Three classifiers are taken into consideration for this report: Decision Tree, Random Forest and Naive Bayes. Since we have a binary categorical target feature and a mix of both categorical and numerical descriptive features and categorical features bagging the higher proportion, Decision Tree is chosen as the baseline classifier. Each classifier is trained to make probability predictions so that we are able to adjust prediction threshold to refine the performance. We respectively split the full dataset into training data (70%) and test data (30%), each of which resembles the full data by obtaining the almost similar proportion of target classes, i.e. approximately 88% of individuals not subscribing to the term deposit and 12% subscribing. Stratified sampling is used to cater the slight imbalance class of the target feature. Next, for each classsifer, we determine the optimal probability threshold. With the tuned hyperparameters and the optimal thresholds, we are able to make predictions on the test data. The mean misclassification error rate (mmce) performance measure is used during finetuning. In addition, paired t-test and confusion matrix are used to evaluate classifier’s performance. The modelling is implemented in R with the @mlr package (Bischl et al. 2016)[2] while @spsa package is used for feature selection.
#Library
library(mlr)
library(tidyverse)
#Load the dataset
bank <- read.csv("bank_clean.csv")
#70% of the dataset
set.seed(1234)
smp_size <- floor(0.7*nrow(bank))
set.seed(123)
train_index <- sample(seq(nrow(bank)), size = smp_size)
#Assign to the train and test datasets
train <- bank[train_index, ]
test <- bank[-train_index, ]
# Check the proportion of target y in each dataset
prop.table(table(train$y))
##
## no yes
## 0.8828009 0.1171991
prop.table(table(test$y))
##
## no yes
## 0.8835152 0.1164848
After the split, the training and test data are quite balanced and representative of the full dataset. Hence, it is safe to use training data for modelling and test data for model evaluation.
For the hyperparameter finetuning process, a 5-fold cross validation resampling strategy is adopted.
# Configure classification task
classif.task <- makeClassifTask(data = train, target = 'y', id = 'bank')
#Get additional information
getTaskDesc(classif.task)
## $id
## [1] "bank"
##
## $type
## [1] "classif"
##
## $target
## [1] "y"
##
## $size
## [1] 31647
##
## $n.feat
## numerics factors ordered functionals
## 6 7 0 0
##
## $has.missings
## [1] FALSE
##
## $has.weights
## [1] FALSE
##
## $has.blocking
## [1] FALSE
##
## $has.coordinates
## [1] FALSE
##
## $class.levels
## [1] "no" "yes"
##
## $positive
## [1] "no"
##
## $negative
## [1] "yes"
##
## $class.distribution
##
## no yes
## 27938 3709
##
## attr(,"class")
## [1] "ClassifTaskDesc" "SupervisedTaskDesc" "TaskDesc"
getTaskTargetNames(classif.task)
## [1] "y"
getTaskType(classif.task)
## [1] "classif"
getTaskClassLevels(classif.task)
## [1] "no" "yes"
getTaskFeatureNames(classif.task)
## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "duration" "campaign"
## [11] "pdays" "previous" "poutcome"
The object classif.task has summarised key point information we have in our training dataset. Our target feature is y containing binary responses with respectively their distribution “No” (27938) and “Yes” (3709). Besides the target features, it is shown that the number of descriptive features is 13.
# Configure tune control search and a 5-CV stratified sampling
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc("CV", iters = 5L, stratify = TRUE)
Two arguments set for Decision Tree are “minsplit” and “maxdepth”. In particular, according to Atkinson(2019) [3], “maxdepth” represents maximum depth of any node of the final tree, with the root node counted as depth 0 and “minspit” is the minimum number of observations that must exist in a node in order for a split to be attempted. Suggested by getParamSet(), we respectively set “maxdepth” and “minspit” the set of sequence from 1 to 30. We find that the optimal results for Decision Tree are when maxdepth=28; minsplit=28 : mmce.test.mean=0.0989351.
From the plot, we can see that it is until 2-3 iterations that the mmce drops and stablises.
# Configure learners with probability type
learner1 <- makeLearner('classif.rpart', predict.type = 'prob')
# Obtain parameters available for fine-tuning
getParamSet(learner1)
## Type len Def Constr Req Tunable Trafo
## minsplit integer - 20 1 to Inf - TRUE -
## minbucket integer - - 1 to Inf - TRUE -
## cp numeric - 0.01 0 to 1 - TRUE -
## maxcompete integer - 4 0 to Inf - TRUE -
## maxsurrogate integer - 5 0 to Inf - TRUE -
## usesurrogate discrete - 2 0,1,2 - TRUE -
## surrogatestyle discrete - 0 0,1 - TRUE -
## maxdepth integer - 30 1 to 30 - TRUE -
## xval integer - 10 0 to Inf - FALSE -
## parms untyped - - - - TRUE -
# Make Param Set
ps1 <- makeParamSet(
makeDiscreteParam('maxdepth', values = c(seq(1,30,3))),
makeDiscreteParam('minsplit', values = c(seq(1,30,3))))
# Configure tune Params settings
tunedLearner1_tuneparams <- tuneParams(learner = learner1,
task = classif.task,
resampling = rdesc,
par.set = ps1,
control = ctrl,
show.info =FALSE
)
# Getting the hyper parameter effects:
learner1_effect <- generateHyperParsEffectData(tunedLearner1_tuneparams)
#Plot the effect
plotHyperParsEffect(learner1_effect, x = "iteration", y = "mmce.test.mean", plot.type = "line") +
ggtitle("The Hyperparameter Effects of Decision Tree")
# Making the tuned model:
tunedLearner1 <- setHyperPars(learner1, par.vals = tunedLearner1_tuneparams$x)
# Train the tune wrappers
tunedMod1 <- train(tunedLearner1, classif.task)
# Predict on training data
tunedPred1 <- predict(tunedMod1, classif.task)
We fine-tune the number of features randomly sampled as candidates at each split (i.e. mtry). For a classification problem, Breiman (2001)[4] learned that mtry is the square root of p where p is the number of descriptive features available in the dataset. In our case, square root of 13 is 3.6. Hence, we experimented the set of mtry = 2, 3, 4, 5, 6, which does not fall out of the range given by getParamSet. As for ntree argument, we set a sequence of values ranging from 10 to 100. The result is mtry=3; ntree=100 : mmce.test.mean=0.0987772.
The plot shows that mmce last drops from iteration 17 and stabilses thereafter.
# Configure learners with probability type
learner2 <- makeLearner('classif.randomForest', predict.type = 'prob')
# Obtain parameters available for fine-tuning
getParamSet(learner2)
## Type len Def Constr Req Tunable Trafo
## ntree integer - 500 1 to Inf - TRUE -
## mtry integer - - 1 to Inf - TRUE -
## replace logical - TRUE - - TRUE -
## classwt numericvector <NA> - 0 to Inf - TRUE -
## cutoff numericvector <NA> - 0 to 1 - TRUE -
## strata untyped - - - - FALSE -
## sampsize integervector <NA> - 1 to Inf - TRUE -
## nodesize integer - 1 1 to Inf - TRUE -
## maxnodes integer - - 1 to Inf - TRUE -
## importance logical - FALSE - - TRUE -
## localImp logical - FALSE - - TRUE -
## proximity logical - FALSE - - FALSE -
## oob.prox logical - - - Y FALSE -
## norm.votes logical - TRUE - - FALSE -
## do.trace logical - FALSE - - FALSE -
## keep.forest logical - TRUE - - FALSE -
## keep.inbag logical - FALSE - - FALSE -
# Make Param Set
ps2 <- makeParamSet(
makeDiscreteParam('mtry', values = c(2,3,4,5,6)),
makeDiscreteParam('ntree', values = c(seq(10,100,10))
))
# Configure tune Params settings
tunedLearner2_tuneparams <- tuneParams(learner = learner2,
task = classif.task,
resampling = rdesc,
par.set = ps2,
control = ctrl,
show.info = FALSE
)
# Getting the hyper parameter effects:
learner2_effect <- generateHyperParsEffectData(tunedLearner2_tuneparams)
#Plot the effect
plotHyperParsEffect(learner2_effect, x = "iteration", y = "mmce.test.mean", plot.type = "line") + ggtitle("The Hyperparameter Effects of Random Forest")
# Making the tuned model:
tunedLearner2 <- setHyperPars(learner2, par.vals = tunedLearner2_tuneparams$x)
# Train the tune wrappers
tunedMod2 <- train(tunedLearner2, classif.task)
# Predict on training data
tunedPred2 <- predict(tunedMod2, classif.task)
We made attempts to tune on the laplace. By using the optimal kernel, we ran a grid search from 0 to 25. The optimal output was laplace=0 : mmce.test.mean=0.1163778.
The plot shows that the mmce drops to the lowest point at around second iteration and then remains stable.
# Configure learners with probability type
learner4 <- makeLearner('classif.naiveBayes',predict.type = 'prob')
# Obtain parameters available for fine-tuning
getParamSet(learner4)
## Type len Def Constr Req Tunable Trafo
## laplace numeric - 0 0 to Inf - TRUE -
# Make Param Set
ps4 <-makeParamSet(makeNumericParam("laplace", lower = 0, upper = 25))
# Configure tune Params settings
tunedLearner4_tuneparams <- tuneParams(learner = learner4,
task = classif.task,
resampling = rdesc,
par.set = ps4,
control = ctrl,
show.info = FALSE)
# Getting the hyper parameter effects:
learner4_effect <- generateHyperParsEffectData(tunedLearner4_tuneparams)
#Plot the effect
plotHyperParsEffect(learner4_effect, x = "iteration", y = "mmce.test.mean", plot.type = "line") + ggtitle("The Hyperparameter Effects of Naive Bayes")
# Making the tuned model:
tunedLearner4 <- setHyperPars(learner4, par.vals = tunedLearner4_tuneparams$x)
# Train the tune wrappers
tunedMod4 <- train(tunedLearner4, classif.task)
# Predict on training data
tunedPred4 <- predict(tunedMod4, classif.task)
We have tried feature selection available in the SPSA package on decision to see if the performance could be better when there are fewer features.
# Run SPSA on tuned Decision Tree learner
learner1_FS <- spFSR:: spFeatureSelection(classif.task, wrapper = tunedLearner1,
measure = mmce, num.features.selected = 0, show.info = FALSE)
# Get the best models from feature selection
spsaModel <- learner1_FS$best.model
#Plot the important variables
spFSR::plotImportance(learner1_FS)
For the model predicting using Feature Selection, the best model has 10 descriptive features, 35 iterations and best measure value of 0.09862.The importance of the features is also plotted, which indicates that poutcome is the most important variable.
The following plots depict the value of mmce vs. the range of probability thresholds. The thresholds are approximately 0.424, 0.646, and 0.162 for Decision Tree, Random Forest, and Naive Bayes classifiers respectively. These thresholds are used to determine the probability of an individual subscribing to the deposit term.
# Generate data on threshold vs. performance(s) and
d1 <- generateThreshVsPerfData(tunedPred1, measures = list(mmce))
# Plot the threshold adjustment
plotThreshVsPerf(d1) + labs(title = 'Threshold Adjustment for Decision Tree', x = 'Threshold')
# Get threshold value
threshold1 <- d1$data$threshold[ which.min(d1$data$mmce) ]
# Generate data on threshold vs. performance(s) and
d2 <- generateThreshVsPerfData(tunedPred2, measures = list(mmce))
# Plot the threshold adjustment
plotThreshVsPerf(d2) + labs(title = 'Threshold Adjustment for Random Forest', x = 'Threshold')
# Get threshold value
threshold2 <- d2$data$threshold[ which.min(d2$data$mmce) ]
# Generate data on threshold vs. performance(s)
d4 <- generateThreshVsPerfData(tunedPred4, measures = list(mmce))
# Plot the threshold adjustment
plotThreshVsPerf(d4) + labs(title = 'Threshold Adjustment for Naive Bayes', x = 'Threshold')
# Get threshold value
threshold4 <- d4$data$threshold[ which.min(d4$data$mmce) ]
We would use tuned wrapper models and optimal thresholds from previous sections to make predictions on the test data.
The performance measures used to evaluate the model are:
AUC: Area Under The Curve. The higher, the better model.
mmce: Misclassification error rate. The lower, the better model.
# Decision Tree
testPred1 <- predict(tunedMod1, newdata = test)
testPred1 <- setThreshold(testPred1, threshold1 )
# Random Forest
testPred2 <- predict(tunedMod2, newdata = test)
testPred2 <- setThreshold(testPred2, threshold2 )
# Naive Bayes
testPred4 <- predict(tunedMod4, newdata = test)
testPred4 <- setThreshold(testPred4, threshold4 )
# Decision Tree with Feature Selection
testPred1_FS <- predict(spsaModel, newdata = test)
#Comparing Decision Tree performance with Random Forest and Naive Bayes by using `plotROCCurves` plots
compare1 <- generateThreshVsPerfData(list(DecisionTree = testPred1, randomForest = testPred2, naivebayes = testPred4, DecisionTree_FS = testPred1_FS),
measures = list(fpr, tpr))
plotROCCurves(compare1)
The x-axis represents the False positive rate while the y-axis is the True positive rate. The 45-degree dotted line represents the uninformative line in the ROC curve. It is to say that the closer the curve comes to the line, the less accurate the model, meanwhile, the curve which is more sided to the upper left is more fit.
In our context, it is clear that the Decision Tree with Feature Selection lies near to the curve which represents an inappropriate fit. That means it is not any better than the random guess. On the other hand, the purple curve, which is Random Forest, is the farthest to the dotted line and closest to the top left.
However, the visualisation is only for reference purpose, it should be confirmed with the test.
We continue to fit the optimised models on the test data. Since cross validation itself is a random process, we have performed pairwise t-tests to determine if any difference between the performance of any two classifiers is statistically significant. First, 5-fold stratified cross-validation is performed on each best model (without any repetitions). Second,paired t-test is conducted for the AUC score between the following model combinations:
** Decision Tree and Random Forest
** Decision Tree and Naïve Bayes
** Random Forest and Naïve Bayes
The Benchmark() function is exercised which allows us to compare different learning algorithms across one or more tasks on a given resampling strategy.
# Configure classification task for test data
classif.task_test <- makeClassifTask(data = test, target = 'y', id = 'bank')
#Perform benchmark on each learner
bmr <- benchmark(learners = list(
makeLearner('classif.rpart', predict.type = 'prob'),
makeLearner('classif.randomForest', predict.type = 'prob'),
makeLearner('classif.naiveBayes',predict.type = 'prob')
), classif.task_test, rdesc, measures = auc)
## Task: bank, Learner: classif.rpart
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.7033568
## [Resample] iter 2: 0.7638780
## [Resample] iter 3: 0.7629817
## [Resample] iter 4: 0.7713499
## [Resample] iter 5: 0.7404331
##
## Aggregated Result: auc.test.mean=0.7483999
##
## Task: bank, Learner: classif.randomForest
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.8689369
## [Resample] iter 2: 0.8966342
## [Resample] iter 3: 0.8621068
## [Resample] iter 4: 0.8916928
## [Resample] iter 5: 0.8845333
##
## Aggregated Result: auc.test.mean=0.8807808
##
## Task: bank, Learner: classif.naiveBayes
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.8212388
## [Resample] iter 2: 0.8454690
## [Resample] iter 3: 0.8085265
## [Resample] iter 4: 0.8243308
## [Resample] iter 5: 0.8474385
##
## Aggregated Result: auc.test.mean=0.8294007
##
#Get the overal performance
performance <- getBMRPerformances(bmr, as.df = TRUE)
#Subset the data frame for Decision Tree
performance_rpart <- performance[c(1:5),]
#Subset the data frame for Random Forest
performance_rf <- performance[c(6:10),]
#Subset the data frame for Naive Bayes
performance_nb <- performance[c(11:15),]
# t-test for Decision Tree and Random Forest
t.test(performance_rpart$auc, performance_rf$auc, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: performance_rpart$auc and performance_rf$auc
## t = -11.863, df = 4, p-value = 0.0002891
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1633637 -0.1013981
## sample estimates:
## mean of the differences
## -0.1323809
# t-test for Decision Tree and Naive Bayes
t.test(performance_rpart$auc, performance_nb$auc, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: performance_rpart$auc and performance_nb$auc
## t = -5.6718, df = 4, p-value = 0.004767
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.12065185 -0.04134985
## sample estimates:
## mean of the differences
## -0.08100085
# t-test for Random Forest and Naive Bayes
t.test(performance_rf$auc, performance_nb$auc, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: performance_rf$auc and performance_nb$auc
## t = 10.511, df = 4, p-value = 0.0004633
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03780758 0.06495256
## sample estimates:
## mean of the differences
## 0.05138007
The null hypothesis for the test is that both algorithms perform equally well on the dataset. With p values smaller than 5% level of significance, we reject the null hypothesis signifying difference in model performance. This concludes that at 95% CI level, Random Forest is statistically the best model in this competition (in terms of AUC) when compared on the test data.
# Calculate the confusion matrix for Decision Tree
calculateConfusionMatrix( testPred1,relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true no yes -err.-
## no 0.97/0.92 0.03/0.39 0.03
## yes 0.64/0.08 0.36/0.61 0.64
## -err.- 0.08 0.39 0.10
##
##
## Absolute confusion matrix:
## predicted
## true no yes -err.-
## no 11619 365 365
## yes 1008 572 1008
## -err.- 1008 365 1373
performance(testPred1, measures = list(f1, tpr, tnr, fpr, fnr, mmce))
## f1 tpr tnr fpr fnr mmce
## 0.94421194 0.96954272 0.36202532 0.63797468 0.03045728 0.10122383
# Calculate the confusion matrix for Random Forest
calculateConfusionMatrix( testPred2,relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true no yes -err.-
## no 0.93/0.94 0.07/0.48 0.07
## yes 0.42/0.06 0.58/0.52 0.42
## -err.- 0.06 0.48 0.11
##
##
## Absolute confusion matrix:
## predicted
## true no yes -err.-
## no 11135 849 849
## yes 669 911 669
## -err.- 669 849 1518
performance(testPred2, measures = list(f1, tpr, tnr, fpr, fnr, mmce))
## f1 tpr tnr fpr fnr mmce
## 0.93618631 0.92915554 0.57658228 0.42341772 0.07084446 0.11191389
# Calculate the confusion matrix for Naive Bayes
calculateConfusionMatrix( testPred4,relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true no yes -err.-
## no 0.96/0.92 0.04/0.46 0.04
## yes 0.66/0.08 0.34/0.54 0.66
## -err.- 0.08 0.46 0.11
##
##
## Absolute confusion matrix:
## predicted
## true no yes -err.-
## no 11529 455 455
## yes 1046 534 1046
## -err.- 1046 455 1501
performance(testPred4, measures = list(f1, tpr, tnr, fpr, fnr, mmce))
## f1 tpr tnr fpr fnr mmce
## 0.93888188 0.96203271 0.33797468 0.66202532 0.03796729 0.11066057
# Calculate the confusion matrix for Decision Tree with Feature Selection
calculateConfusionMatrix(testPred1_FS,relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true no yes -err.-
## no 3e-04/1.00 1e+00/0.88 1.00
## yes 0e+00/0.00 1e+00/0.12 0.00
## -err.- 0.00 0.88 0.88
##
##
## Absolute confusion matrix:
## predicted
## true no yes -err.-
## no 3 11981 11981
## yes 0 1580 0
## -err.- 0 11981 11981
performance( testPred1_FS )
## mmce
## 0.883294
The total number of errors for single (true and predicted) classes is displayed in the -err.- row and column respectively. All the tuned classifiers accurately predict the clients who did not subscribe to term deposit. Decision Tree has the least mmce statistic, which shows it is more efficient than others.
The decision tree works on information gain criterion and makes a split on the basis of homogenity/purity, as a result of this generally it is believed to work better on data with categorical descriptive features. With the same assumption in mind and a high proportion of categorical descriptive features in our dataset, we considered decision tree as the baseline model.In the end, the underlying assumption is satisfied as the Decison Tree gave the higher precision for the interested class and lowest mean misclassification error rate. The main strength of this project is strict following of hyperparameter tuning and feature selection. We also used the t-test to statistically test the significance of the model and asess its quality besides cross-validation methods. We applied the three popular methods known to work better with categorical descriptive features based on the nature of our dataset. On the contrary, certain things could have been improved like more reasearch in the domain of the dataset. The subject matter expertise definitely enhances the ability to interpret, analyse, explore and drawing conclusions about the relationships amongst the features. This can be worked upon in future and could be enhanced to attain efficiency and practical understanding of the modelling approach. The performance of the models can also be compared and validated at other splits like 80:20 or 60:40. This will provide further basis to the selected model.
Both the AUC and Cross-Validation method yields different outcomes. while AUC produces Random Forest to be the most efficient one based on the area covered under the curve, the canfusion matrix drifts in favour of decision tree providing better statistics for Precision, Recall , F1 and MMCE. It is also noticed that the decision tree is sensitive to number of features selected for modelling as reducing the number of descriptive features to 10 reduces the performance of the model with lower scores for Precision, Recall and other parameters. For this reason, working with full features is preferable over selected features for this dataset. In the end, it can be affirmed that decision tree outperforms Random Forest and Naive Bayes in terms of cross-validation parameters and we decided to settle on this as the one final and best model.
[1]. S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.
[2]. Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.
[3]. Atkinson, B. 2019. “Rpart.” Machine Learning.
[4]. Breiman, L. 2001. “Random Forests.” Machine Learning.