This report deals with regression modelling of bank telemarketing data. The data is obtained from https://archive.ics.uci.edu/ml/datasets/bank+marketing.
We first invoke the required libraries.
library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(scales)
library(lmtest)
library(inspectdf)
library(randomForest)
library(e1071)
# library(ggcorrplot)
options(scipen = 100, max.print = 1e+06)
Then download the data.
colnames(bankmart)
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
str(bankmart)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
colSums(is.na(bankmart))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
summary (bankmart)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:41188
## Class :character
## Mode :character
##
##
##
ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1,
## layout.exp = 2): data in column(s) 'job', 'marital', 'education', 'default',
## 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y' are not
## numeric and were ignored
We check the unique values of ‘pdays’.
unique(bankmart$pdays)
## [1] 999 6 4 3 5 1 0 10 7 8 9 11 2 12 13 14 15 16 21
## [20] 17 18 22 25 26 19 27 20
Below is a brief description of the features of this dataset.
The Output variable or target is in column ‘y’, a binary (‘yes’,‘no’) data on whether the client has subscribed to a term deposit.
The column ‘campaign’ shows the number of contacts performed during this campaign and for this client (numeric, includes last contact).
The column ‘pdays’ shows the number of days that passed by after the client was last contacted from a previous campaign (numeric). The column website tells that 999 means client was not previously contacted, but for this dataset, the value is -1.
The column ‘previous’ shows the number of contacts performed before this campaign and for this client (numeric).
The column ‘poutcome’ shows the outcome of the previous marketing campaign (categorical : ‘failure’, ‘nonexistent’: ‘success’).
The column ‘contact’ shows the contact communication type (categorical: ‘cellular’,‘telephone’).
The column ‘month’ shows the last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’).
The column ‘day’ shows the last contact day (numeric).
The column ‘duration’ shows the last contact duration, in seconds (numeric).
The column ‘age’ (numeric).
The column ‘job’ (is (categorical).
The column ‘marital status’ (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’).
The column ‘education’ (categorical).
The column ‘default’ shows whether the person has ever defaulted on credit facility (categorical:‘no’, ‘yes’, ‘unknown’)
The column ‘housing’ shows whether the person has a housing loan (categorical:‘no’, ‘yes’, ‘unknown’)
The column ‘loan’ shows whether the person has a personal loan (categorical:‘no’, ‘yes’, ‘unknown’)
The column ‘deposit’ shows the deposit amount (numeric).
We check for missing values.
colSums(is.na(bankmart))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
The data types of the dataset requires extensive changes namely “chr” types to “factor”.
bankmart$job <- as.factor(bankmart$job)
bankmart$marital <- as.factor(bankmart$marital)
bankmart$education<- as.factor(bankmart$education)
bankmart$default <- as.factor(bankmart$default)
bankmart$housing <- as.factor(bankmart$housing)
bankmart$loan <- as.factor(bankmart$loan)
bankmart$contact<- as.factor(bankmart$contact)
bankmart$month<- as.factor(bankmart$month)
bankmart$day_of_week <- as.factor(bankmart$day_of_week)
bankmart$poutcome <- as.factor(bankmart$poutcome)
bankmart$y <- as.factor(bankmart$y)
We change any 999 value found in ‘pdays’ to 0.
bankmart$pdays[which(bankmart$pdays == "999")] = 0
unique(bankmart$pdays)
## [1] 0 6 4 3 5 1 10 7 8 9 11 2 12 13 14 15 16 21 17 18 22 25 26 19 27
## [26] 20
If ‘duration’ value is 0 then outcome will most likely be ‘no’. This will present autocorrelation and multicollinearity issues into the predictive model. We will delete rows with ‘duration’ equals 0.
bankmart<-bankmart[bankmart$duration != 0, ]
and also we will change the name of column “y” to “Sale”:
colnames(bankmart)[21] <- "Sale"
We check that changes made have taken effect.
summary(bankmart)
## age job marital
## Min. :17.00 admin. :10421 divorced: 4611
## 1st Qu.:32.00 blue-collar: 9252 married :24925
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2923
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12167 no :32584 no :18621 no :33946
## high.school : 9513 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6044 yes : 3 yes :21573 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26141 may :13767 fri:7826 Min. : 1.0
## telephone:15043 jul : 7174 mon:8513 1st Qu.: 102.0
## aug : 6177 thu:8623 Median : 180.0
## jun : 5318 tue:8088 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2631 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0000 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.:0.000 nonexistent:35559
## Median : 2.000 Median : 0.0000 Median :0.000 success : 1373
## Mean : 2.567 Mean : 0.2213 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.:0.000
## Max. :56.000 Max. :27.0000 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08192 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed Sale
## Min. :4964 no :36544
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
str(bankmart)
## 'data.frame': 41184 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num 0 0 0 0 0 0 0 0 0 0 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ Sale : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
We check for correlation.
ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1,
## layout.exp = 2): data in column(s) 'job', 'marital', 'education', 'default',
## 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'Sale' are not
## numeric and were ignored
The target variable has not emerged from the plot as it is a factor type data. But based on the plot above, there are some predictor variables which have a high correlation with one another especially for the macroeconomic indicators group of predictors. These variables are ‘nr.employed’, ‘euribor3m’, and ‘cons.price.idx’. This is an early warning that this data might not be appropriate for a model that requires predictors to be independent variables to each other..
We make the following quick observations:
Mean age of respondents is 40.
A large population of the respondents work in administrative, blue-collar, and technician jobs.
More than half of the respondents are married.
More than half of the respondents have a university or high school education and 5243 of them have a professional qualification.
Calls to the respondents were made daily except weekends in roughly equal volume of calls.
The mean number of times phone calls were made was 2.5 times and the maximum number of calls made was 56 times.
The mean duration of calls were 258 seconds.
The respondents were approached by telephonist marketers no more than once in the last three(3) years.
The remaining data concern with macroeconomic indicators over the period of data collection. The means of these indicators are a euribor bank rate of 3.621, a consumer price index of 93.58, a consumer price index of -40.5, and an employment rate of 0.082.
More than 75% of respondents have not defaulted in any loans while about 25% default status is unknown. Only 3 of the respondents have defaulted on loans.
More than half of the respondents already have a housing loan (21576) while a little less than 50% (18622) have no housing.
More than 75% of the respondents (33950) have no personal loan, 6248 already have a personal loan, and the status of personal loans of 990 is unknown.
More than 25% (13769) of the calls were made in the month of May.
We begin to split the dataset into 80% “train” and 20% “test” sets.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
samplesize <- round(0.8 * nrow(bankmart), 0)
index <- sample(seq_len(nrow(bankmart)), size = samplesize)
bankmart_train <- bankmart[index, ]
bankmart_test <- bankmart[-index, ]
prop.table(table(bankmart$Sale))
##
## no yes
## 0.8873349 0.1126651
barplot(prop.table(table(bankmart$Sale)),
col = rainbow(2),
ylim = c(0, 1.0),
main = "Class Distribution")
Based on the plot it clearly evident that 90% of the data in one class
and the remaining 10% in another class.
table(bankmart$Sale)
##
## no yes
## 36544 4640
If we were to make a model based on this dataset the accuracy of predicting respondents who do not subscribe to the term deposit product will be higher compared to respondents who decline the product.
Suppose class “yes” contain 4640 observations we need to take only 4640 observations from class “no”. We may use the downSample() function to get a more balanced distribution for the “train” set.
library(caret)
set.seed(1234)
bankmart_train_down<-downSample(x = bankmart_train %>% select(-Sale),
y = bankmart_train$Sale,
yname = "Sale")
table(bankmart_train_down$Sale)
##
## no yes
## 3758 3758
Now the size of each class is 3758.
The Naive Bayes model
We build the model using all predictors.
library(e1071)
model_naive_1 <- naiveBayes(Sale ~ ., data = bankmart_train_down, laplace = 1)
We store the outcome using predictors in the “bankmart_test” set.
pred_naive_1<- predict(model_naive_1, newdata = bankmart_test[,-1])
## Warning in predict.naiveBayes(model_naive_1, newdata = bankmart_test[, -1]):
## Type mismatch between training and new data for variable 'age'. Did you use
## factors with numeric labels for training, and numeric values for new data?
We build the model using personal data and credit history and status data :‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘housing’, and ‘loan’ predictors.
model_naive_2 <- naiveBayes(Sale ~ age + job + marital + education + default + housing +loan, data = bankmart_train_down, laplace = 1)
pred_naive_2<- predict(model_naive_2, newdata = bankmart_test[,-1])
## Warning in predict.naiveBayes(model_naive_2, newdata = bankmart_test[, -1]):
## Type mismatch between training and new data for variable 'age'. Did you use
## factors with numeric labels for training, and numeric values for new data?
We build the model using descriptors of communication with respondent :‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘housing’, and ‘loan’ predictors.
model_naive_3 <- naiveBayes(Sale ~ duration + campaign + pdays + previous + poutcome, data = bankmart_train_down, laplace = 1)
pred_naive_3<- predict(model_naive_3, newdata = bankmart_test[,-1])
We build the model using descriptors of macroeconomic indicators : ‘emp.var.rate’, ‘cons.price.idx’, ‘cons.conf.idx’, ‘euribor3m’, and ‘nr.employed predictors’.
model_naive_4 <- naiveBayes(Sale ~ emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + nr.employed , data = bankmart_train_down, laplace = 1)
pred_naive_4 <- predict(model_naive_4, newdata = bankmart_test[,-1])
We display the results of predictions.
(conf_matrix_naive <- table(pred_naive_1, bankmart_test$Sale))
##
## pred_naive_1 no yes
## no 5949 223
## yes 1406 659
confusionMatrix(pred_naive_1, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 5949 223
## yes 1406 659
##
## Accuracy : 0.8022
## 95% CI : (0.7935, 0.8108)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3496
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.7472
## Specificity : 0.8088
## Pos Pred Value : 0.3191
## Neg Pred Value : 0.9639
## Prevalence : 0.1071
## Detection Rate : 0.0800
## Detection Prevalence : 0.2507
## Balanced Accuracy : 0.7780
##
## 'Positive' Class : yes
##
(conf_matrix_naive <- table(pred_naive_2, bankmart_test$Sale))
##
## pred_naive_2 no yes
## no 3878 298
## yes 3477 584
confusionMatrix(pred_naive_2, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3878 298
## yes 3477 584
##
## Accuracy : 0.5417
## 95% CI : (0.5309, 0.5525)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0732
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.6621
## Specificity : 0.5273
## Pos Pred Value : 0.1438
## Neg Pred Value : 0.9286
## Prevalence : 0.1071
## Detection Rate : 0.0709
## Detection Prevalence : 0.4930
## Balanced Accuracy : 0.5947
##
## 'Positive' Class : yes
##
(conf_matrix_naive <- table(pred_naive_3, bankmart_test$Sale))
##
## pred_naive_3 no yes
## no 6874 445
## yes 481 437
confusionMatrix(pred_naive_3, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6874 445
## yes 481 437
##
## Accuracy : 0.8876
## 95% CI : (0.8806, 0.8943)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.9428
##
## Kappa : 0.4225
##
## Mcnemar's Test P-Value : 0.2501
##
## Sensitivity : 0.49546
## Specificity : 0.93460
## Pos Pred Value : 0.47603
## Neg Pred Value : 0.93920
## Prevalence : 0.10708
## Detection Rate : 0.05305
## Detection Prevalence : 0.11145
## Balanced Accuracy : 0.71503
##
## 'Positive' Class : yes
##
(conf_matrix_naive <- table(pred_naive_4, bankmart_test$Sale))
##
## pred_naive_4 no yes
## no 5286 260
## yes 2069 622
confusionMatrix(pred_naive_4, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 5286 260
## yes 2069 622
##
## Accuracy : 0.7173
## 95% CI : (0.7074, 0.727)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2228
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.70522
## Specificity : 0.71869
## Pos Pred Value : 0.23114
## Neg Pred Value : 0.95312
## Prevalence : 0.10708
## Detection Rate : 0.07551
## Detection Prevalence : 0.32670
## Balanced Accuracy : 0.71196
##
## 'Positive' Class : yes
##
Our best Naive Bayes model is the “model_naive_3” gives an accuracy of prediction of 89% with 95% Confidence interval of accuracy between 88% and 89%. The “model_naive_3” uses predictors that relate to data describing contact or interaction with respondents. The autocorrelation concerns have been removed in this model by excluding predictors that show high correlation with each other.
Decision Tree modelling is a classification predictive modelling method. The outcome (dependent) variable is a categorical variable (binary) and predictor (independent) variables can be continuous or categorical variables (binary). The decision trees can be used for both regression and classification. A decision tree assumes independence of predictor variables and does not assume linearity relationships between predictors.
The main benefits of decision trees are that they require minimal data preparation, they do not require feature scaling, can handle missing values automatically, and have short training process times.
The main decision tree issues are overfitting (performs well on training data only and not on unseen data), generation of new nodes to fit data which can make it highly complex due to a small number of ‘noisy’ data, and that it is not suitable for large datasets.
knitr::include_graphics('DT_Example.jpg')
To begin building this model we confirm that relevant “character” type predictors and targets are in “factor” type form.
str(bankmart_train_down)
## 'data.frame': 7516 obs. of 21 variables:
## $ age : int 27 30 58 53 45 37 39 50 53 36 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 6 11 2 10 10 3 6 10 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 3 2 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 2 4 3 3 3 4 6 3 6 7 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 2 2 2 1 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 3 3 1 2 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 1 2 1 2 2 1 1 1 1 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 4 7 5 7 7 5 7 4 2 2 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 4 4 2 4 3 4 1 3 2 1 ...
## $ duration : int 68 414 198 205 525 548 264 129 119 248 ...
## $ campaign : int 3 1 1 2 1 3 1 1 3 1 ...
## $ pdays : num 0 0 0 0 0 0 0 0 0 0 ...
## $ previous : int 0 0 0 1 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 1 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.4 -1.8 1.4 -1.8 1.1 1.4 -1.8 1.4 1.4 1.4 ...
## $ cons.price.idx: num 93.9 92.9 94.5 92.9 94 ...
## $ cons.conf.idx : num -42.7 -46.2 -41.8 -46.2 -36.4 -41.8 -46.2 -42.7 -36.1 -36.1 ...
## $ euribor3m : num 4.96 1.34 4.87 1.29 4.86 ...
## $ nr.employed : num 5228 5099 5228 5099 5191 ...
## $ Sale : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
set.seed(123)
bankmart_train_dt <- ctree(formula = Sale ~ euribor3m + housing +loan + euribor3m + duration + campaign + pdays + previous + poutcome, data =bankmart_train_down)
plot(bankmart_train_dt, type = "simple")
pred_dt <- predict(bankmart_train_dt, newdata =bankmart_test, type="response")
confusionMatrix(pred_dt, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6075 69
## yes 1280 813
##
## Accuracy : 0.8362
## 95% CI : (0.8281, 0.8442)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4661
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.9218
## Specificity : 0.8260
## Pos Pred Value : 0.3884
## Neg Pred Value : 0.9888
## Prevalence : 0.1071
## Detection Rate : 0.0987
## Detection Prevalence : 0.2541
## Balanced Accuracy : 0.8739
##
## 'Positive' Class : yes
##
Our Decision Tree model achieved an accuracy of 83% with a 95% Confidence Interval between an accuracy of 82% to 84%, with a Sensitivity of 92%. The Decision Tree model used the predictors “euribor3m”, “housing”, “loan”, “duration”, “campaign”, “pdays”, “previous”, and “poutcome”.
Random forest chooses a random subset of features and builds many Decision Trees. The model is an averaging of all the predictions of the Decisions trees. It performs “bagging” or (bootstrap aggregation) and “boosting train”. The first Creates subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees). The second creates multiple predictive models to generate the best one based on a voting mechanism or averaging of output. This overcomes the problem of overfitting in decision tree.
library(randomForest)
We use mtry = 4.
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
tuneGrid <- expand.grid(.mtry = c(10))
We set ntree=300 with “nodesize” and “maxnodes” set to default values, and using “euribor3m”, “age”, “education” , “job”, “default”, “housing”, “loan”, “previous”, “pdays”, “campaign”, and “duration” predictors. The following code has been commented to save comsiderable time knitting the document.
# set.seed(1234)
#
# fit_rf <- train(Sale~euribor3m + age + education + job + default + housing + loan + previous + pdays + campaign + duration,
# bankmart_train,
# method = "rf",
# metric = "Accuracy",
# tuneGrid = tuneGrid,
# trControl = trControl,
# importance = TRUE,
# # nodesize = 10,
# ntree = 300,
# # maxnodes = 10
# )
We save the newly built model for future ease of recall.
# saving the model
# saveRDS(fit_rf, file = "fit_rf.rds")
fit_rf<-readRDS("fit_rf.rds")
Next we use the model to predict target using the “test” set of predictors.
prediction <-predict(fit_rf, bankmart_test)
And the confusion Matrix indicators of the predictions.
confusionMatrix(prediction, bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6841 193
## yes 514 689
##
## Accuracy : 0.9142
## 95% CI : (0.9079, 0.9201)
## No Information Rate : 0.8929
## P-Value [Acc > NIR] : 0.00000000007224
##
## Kappa : 0.6131
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.78118
## Specificity : 0.93012
## Pos Pred Value : 0.57273
## Neg Pred Value : 0.97256
## Prevalence : 0.10708
## Detection Rate : 0.08365
## Detection Prevalence : 0.14605
## Balanced Accuracy : 0.85565
##
## 'Positive' Class : yes
##
With this random forest model we have achieved an accuracy of 91% with a 95% Confidence Interval between 90% and 92%. The sensitivity measure how ever is very low at 78%.
We always prefer that Sensitivity or Recall:
TP/TP+FN to approach 1.0
Specificity : TN/ TN + FP to approach zero.
The reason being that this would determine how well our model makes “yes” and “no” predictions. We can check the parameters of the previous random forest model building process as follows.
fit_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 8.81%
## Confusion matrix:
## no yes class.error
## no 28183 1083 0.0370054
## yes 1821 1863 0.4942997
We can see above that our OOB measure registers an error rate of 8.81%.
plot(fit_rf$finalModel)
legend("topright", colnames(fit_rf$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
The plot above shows that the error of predicting a “yes” for target
“Sale” stagnates at 0.5 for trees > 10.
plot(varImp(fit_rf))
The above plot shows that ‘duration’ is the most important predictor variable followed by ‘euribor3m’.
We now proceed to plot the ROC curve and compute a measure of the AUC. ROC is a curve that represents relationship between Sensitivity (TPR) and False Positive Rate (FPR) at each classification threshold. Ideally a good model has a high true positive rate and a low false positive rate.
AUC is the area under an ROC curve. A high AUC signifies a model that has a high TPR/Recall and a low FPR. An AUC close to 1 means that the model can distinguish (‘yes’ and ‘no’) classes very well, while an AUC of 0.5 means that the model is no better at predicting classes than by a random selection process.
prob_test <- predict(fit_rf,newdata=bankmart_test, type = "prob")
library(ROCR)
pred_roc <- prediction(prob_test[,"yes"],bankmart_test$Sale)
perf <- performance(pred_roc, "tpr", "fpr")
perf
## A performance instance
## 'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
## with 295 data points
plot(perf)
auc <-performance(pred_roc, measure = "auc")
print(auc@y.values)
## [[1]]
## [1] 0.9544342
The best Naive Bayes model (“Model 3”) used data describing contact information with telephone clients and this model yielded an accuracy of 88% with a sensitivity of 49 % and specificity of 93%.
The decision tree model yielded an accuracy of 83% with a sensitivity of 92% and a specificity of 82%.
The random forest model yielded an accuracy of 91% with a sensitivity of 78% and a specificity of 93%. The random forest model measures the error for classifying a ‘no’ as 3.7% and the error for classifiying a ‘yes’ as 49%. The OOB from the random forest model is 8.8% with an AUC of 0.95.
1 - https://www.guru99.com/r-random-forest-tutorial.html date:21/03/2023
2 - https://www.kaggle.com/code/jintaepark95/bank-marketing-dataset-eda date:24/03/2023
3 - https://www.r-bloggers.com/2021/05/class-imbalance-handling-imbalanced-data-in-r/ date:24/03/2023
4 - https://www.listendata.com/2015/04/decision-tree-in-r.html date:24/03/2023
5 - https://www.analyticsvidhya.com/blog/2021/06/classification-problem-relation-between-sensitivity-specificity-and-accuracy/ date:24/03/2023
6 - https://intellipaat.com/blog/roc-curve-in-machine-learning/ date:24/03/2023