Introduction

This report deals with regression modelling of bank telemarketing data. The data is obtained from https://archive.ics.uci.edu/ml/datasets/bank+marketing.

Data : Bank Telemarketing

Data Inspection

We first invoke the required libraries.

library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(scales)
library(lmtest)
library(inspectdf) 
library(randomForest)
library(e1071)
# library(ggcorrplot)

options(scipen = 100, max.print = 1e+06)

Then download the data.

colnames(bankmart)
##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"
str(bankmart)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...
colSums(is.na(bankmart))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0
summary (bankmart)
##       age            job              marital           education        
##  Min.   :17.00   Length:41188       Length:41188       Length:41188      
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.02                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :98.00                                                           
##    default            housing              loan             contact         
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:41188       Length:41188       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 102.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.3   Mean   : 2.568  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :56.000  
##      pdays          previous       poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.000   Length:41188       Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.000   Mode  :character   Median : 1.10000  
##  Mean   :962.5   Mean   :0.173                      Mean   : 0.08189  
##  3rd Qu.:999.0   3rd Qu.:0.000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :7.000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5167  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##       y            
##  Length:41188      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1,
## layout.exp = 2): data in column(s) 'job', 'marital', 'education', 'default',
## 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y' are not
## numeric and were ignored

We check the unique values of ‘pdays’.

unique(bankmart$pdays)
##  [1] 999   6   4   3   5   1   0  10   7   8   9  11   2  12  13  14  15  16  21
## [20]  17  18  22  25  26  19  27  20

Below is a brief description of the features of this dataset.

  1. The Output variable or target is in column ‘y’, a binary (‘yes’,‘no’) data on whether the client has subscribed to a term deposit.

  2. The column ‘campaign’ shows the number of contacts performed during this campaign and for this client (numeric, includes last contact).

  3. The column ‘pdays’ shows the number of days that passed by after the client was last contacted from a previous campaign (numeric). The column website tells that 999 means client was not previously contacted, but for this dataset, the value is -1.

  4. The column ‘previous’ shows the number of contacts performed before this campaign and for this client (numeric).

  5. The column ‘poutcome’ shows the outcome of the previous marketing campaign (categorical : ‘failure’, ‘nonexistent’: ‘success’).

  6. The column ‘contact’ shows the contact communication type (categorical: ‘cellular’,‘telephone’).

  7. The column ‘month’ shows the last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’).

  8. The column ‘day’ shows the last contact day (numeric).

  9. The column ‘duration’ shows the last contact duration, in seconds (numeric).

  10. The column ‘age’ (numeric).

  11. The column ‘job’ (is (categorical).

  12. The column ‘marital status’ (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’).

  13. The column ‘education’ (categorical).

  14. The column ‘default’ shows whether the person has ever defaulted on credit facility (categorical:‘no’, ‘yes’, ‘unknown’)

  15. The column ‘housing’ shows whether the person has a housing loan (categorical:‘no’, ‘yes’, ‘unknown’)

  16. The column ‘loan’ shows whether the person has a personal loan (categorical:‘no’, ‘yes’, ‘unknown’)

  17. The column ‘deposit’ shows the deposit amount (numeric).

Data Wrangling

We check for missing values.

colSums(is.na(bankmart))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

The data types of the dataset requires extensive changes namely “chr” types to “factor”.

bankmart$job <- as.factor(bankmart$job)
bankmart$marital <- as.factor(bankmart$marital)
bankmart$education<- as.factor(bankmart$education)
bankmart$default <- as.factor(bankmart$default)
bankmart$housing <- as.factor(bankmart$housing)
bankmart$loan <- as.factor(bankmart$loan)
bankmart$contact<- as.factor(bankmart$contact)
bankmart$month<- as.factor(bankmart$month)
bankmart$day_of_week <- as.factor(bankmart$day_of_week)
bankmart$poutcome <- as.factor(bankmart$poutcome)
bankmart$y <- as.factor(bankmart$y)

We change any 999 value found in ‘pdays’ to 0.

bankmart$pdays[which(bankmart$pdays == "999")] = 0

unique(bankmart$pdays)
##  [1]  0  6  4  3  5  1 10  7  8  9 11  2 12 13 14 15 16 21 17 18 22 25 26 19 27
## [26] 20

If ‘duration’ value is 0 then outcome will most likely be ‘no’. This will present autocorrelation and multicollinearity issues into the predictive model. We will delete rows with ‘duration’ equals 0.

bankmart<-bankmart[bankmart$duration != 0, ]

and also we will change the name of column “y” to “Sale”:

colnames(bankmart)[21] <- "Sale"

We check that changes made have taken effect.

summary(bankmart)
##       age                 job            marital     
##  Min.   :17.00   admin.     :10421   divorced: 4611  
##  1st Qu.:32.00   blue-collar: 9252   married :24925  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2923                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing           loan      
##  university.degree  :12167   no     :32584   no     :18621   no     :33946  
##  high.school        : 9513   unknown: 8597   unknown:  990   unknown:  990  
##  basic.9y           : 6044   yes    :    3   yes    :21573   yes    : 6248  
##  professional.course: 5243                                                  
##  basic.4y           : 4176                                                  
##  basic.6y           : 2292                                                  
##  (Other)            : 1749                                                  
##       contact          month       day_of_week    duration     
##  cellular :26141   may    :13767   fri:7826    Min.   :   1.0  
##  telephone:15043   jul    : 7174   mon:8513    1st Qu.: 102.0  
##                    aug    : 6177   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8088    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2631               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays            previous            poutcome    
##  Min.   : 1.000   Min.   : 0.0000   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.:0.000   nonexistent:35559  
##  Median : 2.000   Median : 0.0000   Median :0.000   success    : 1373  
##  Mean   : 2.567   Mean   : 0.2213   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.: 0.0000   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :27.0000   Max.   :7.000                      
##                                                                        
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08192   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed    Sale      
##  Min.   :4964   no :36544  
##  1st Qu.:5099   yes: 4640  
##  Median :5191              
##  Mean   :5167              
##  3rd Qu.:5228              
##  Max.   :5228              
## 
str(bankmart)
## 'data.frame':    41184 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ Sale          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

We check for correlation.

ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(bankmart, label = TRUE, label_size = 2.9, hjust = 1,
## layout.exp = 2): data in column(s) 'job', 'marital', 'education', 'default',
## 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'Sale' are not
## numeric and were ignored

The target variable has not emerged from the plot as it is a factor type data. But based on the plot above, there are some predictor variables which have a high correlation with one another especially for the macroeconomic indicators group of predictors. These variables are ‘nr.employed’, ‘euribor3m’, and ‘cons.price.idx’. This is an early warning that this data might not be appropriate for a model that requires predictors to be independent variables to each other..

Data : Quick Summary

We make the following quick observations:

  1. Mean age of respondents is 40.

  2. A large population of the respondents work in administrative, blue-collar, and technician jobs.

  3. More than half of the respondents are married.

  4. More than half of the respondents have a university or high school education and 5243 of them have a professional qualification.

  5. Calls to the respondents were made daily except weekends in roughly equal volume of calls.

  6. The mean number of times phone calls were made was 2.5 times and the maximum number of calls made was 56 times.

  7. The mean duration of calls were 258 seconds.

  8. The respondents were approached by telephonist marketers no more than once in the last three(3) years.

  9. The remaining data concern with macroeconomic indicators over the period of data collection. The means of these indicators are a euribor bank rate of 3.621, a consumer price index of 93.58, a consumer price index of -40.5, and an employment rate of 0.082.

  10. More than 75% of respondents have not defaulted in any loans while about 25% default status is unknown. Only 3 of the respondents have defaulted on loans.

  11. More than half of the respondents already have a housing loan (21576) while a little less than 50% (18622) have no housing.

  12. More than 75% of the respondents (33950) have no personal loan, 6248 already have a personal loan, and the status of personal loans of 990 is unknown.

  13. More than 25% (13769) of the calls were made in the month of May.

Modelling

Data-Splitting

We begin to split the dataset into 80% “train” and 20% “test” sets.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here

samplesize <- round(0.8 * nrow(bankmart), 0)
index <- sample(seq_len(nrow(bankmart)), size = samplesize)

bankmart_train <- bankmart[index, ]
bankmart_test <- bankmart[-index, ]
prop.table(table(bankmart$Sale))
## 
##        no       yes 
## 0.8873349 0.1126651
barplot(prop.table(table(bankmart$Sale)),
        col = rainbow(2),
        ylim = c(0, 1.0),
        main = "Class Distribution")

Based on the plot it clearly evident that 90% of the data in one class and the remaining 10% in another class.

table(bankmart$Sale)
## 
##    no   yes 
## 36544  4640

If we were to make a model based on this dataset the accuracy of predicting respondents who do not subscribe to the term deposit product will be higher compared to respondents who decline the product.

Suppose class “yes” contain 4640 observations we need to take only 4640 observations from class “no”. We may use the downSample() function to get a more balanced distribution for the “train” set.

library(caret)
set.seed(1234)

bankmart_train_down<-downSample(x = bankmart_train %>% select(-Sale),
                         y = bankmart_train$Sale,
                         yname = "Sale")


table(bankmart_train_down$Sale)
## 
##   no  yes 
## 3758 3758

Now the size of each class is 3758.

Model : Naive Bayes

The Naive Bayes model

Naive Bayes Model 1

We build the model using all predictors.

library(e1071)

model_naive_1 <- naiveBayes(Sale ~ ., data = bankmart_train_down, laplace = 1)

We store the outcome using predictors in the “bankmart_test” set.

pred_naive_1<- predict(model_naive_1, newdata = bankmart_test[,-1])
## Warning in predict.naiveBayes(model_naive_1, newdata = bankmart_test[, -1]):
## Type mismatch between training and new data for variable 'age'. Did you use
## factors with numeric labels for training, and numeric values for new data?

Naive Bayes Model 2

We build the model using personal data and credit history and status data :‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘housing’, and ‘loan’ predictors.

model_naive_2 <- naiveBayes(Sale ~ age + job + marital + education + default + housing +loan, data = bankmart_train_down, laplace = 1)
pred_naive_2<- predict(model_naive_2, newdata = bankmart_test[,-1])
## Warning in predict.naiveBayes(model_naive_2, newdata = bankmart_test[, -1]):
## Type mismatch between training and new data for variable 'age'. Did you use
## factors with numeric labels for training, and numeric values for new data?

Naive Bayes Model 3

We build the model using descriptors of communication with respondent :‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘housing’, and ‘loan’ predictors.

model_naive_3 <- naiveBayes(Sale ~ duration + campaign + pdays + previous + poutcome, data = bankmart_train_down, laplace = 1)
pred_naive_3<- predict(model_naive_3, newdata = bankmart_test[,-1])

Naive Bayes Model 4

We build the model using descriptors of macroeconomic indicators : ‘emp.var.rate’, ‘cons.price.idx’, ‘cons.conf.idx’, ‘euribor3m’, and ‘nr.employed predictors’.

model_naive_4 <- naiveBayes(Sale ~ emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + nr.employed , data = bankmart_train_down, laplace = 1)
pred_naive_4 <- predict(model_naive_4, newdata = bankmart_test[,-1])

Evaluation : Naive Bayes

Naive Bayes Model 1

We display the results of predictions.

(conf_matrix_naive <- table(pred_naive_1, bankmart_test$Sale))
##             
## pred_naive_1   no  yes
##          no  5949  223
##          yes 1406  659
confusionMatrix(pred_naive_1, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5949  223
##        yes 1406  659
##                                              
##                Accuracy : 0.8022             
##                  95% CI : (0.7935, 0.8108)   
##     No Information Rate : 0.8929             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3496             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.7472             
##             Specificity : 0.8088             
##          Pos Pred Value : 0.3191             
##          Neg Pred Value : 0.9639             
##              Prevalence : 0.1071             
##          Detection Rate : 0.0800             
##    Detection Prevalence : 0.2507             
##       Balanced Accuracy : 0.7780             
##                                              
##        'Positive' Class : yes                
## 

Naive Bayes Model 2

(conf_matrix_naive <- table(pred_naive_2, bankmart_test$Sale))
##             
## pred_naive_2   no  yes
##          no  3878  298
##          yes 3477  584
confusionMatrix(pred_naive_2, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3878  298
##        yes 3477  584
##                                              
##                Accuracy : 0.5417             
##                  95% CI : (0.5309, 0.5525)   
##     No Information Rate : 0.8929             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0732             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.6621             
##             Specificity : 0.5273             
##          Pos Pred Value : 0.1438             
##          Neg Pred Value : 0.9286             
##              Prevalence : 0.1071             
##          Detection Rate : 0.0709             
##    Detection Prevalence : 0.4930             
##       Balanced Accuracy : 0.5947             
##                                              
##        'Positive' Class : yes                
## 

Naive Bayes Model 3

(conf_matrix_naive <- table(pred_naive_3, bankmart_test$Sale))
##             
## pred_naive_3   no  yes
##          no  6874  445
##          yes  481  437
confusionMatrix(pred_naive_3, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6874  445
##        yes  481  437
##                                           
##                Accuracy : 0.8876          
##                  95% CI : (0.8806, 0.8943)
##     No Information Rate : 0.8929          
##     P-Value [Acc > NIR] : 0.9428          
##                                           
##                   Kappa : 0.4225          
##                                           
##  Mcnemar's Test P-Value : 0.2501          
##                                           
##             Sensitivity : 0.49546         
##             Specificity : 0.93460         
##          Pos Pred Value : 0.47603         
##          Neg Pred Value : 0.93920         
##              Prevalence : 0.10708         
##          Detection Rate : 0.05305         
##    Detection Prevalence : 0.11145         
##       Balanced Accuracy : 0.71503         
##                                           
##        'Positive' Class : yes             
## 

Naive Bayes Model 4

(conf_matrix_naive <- table(pred_naive_4, bankmart_test$Sale))
##             
## pred_naive_4   no  yes
##          no  5286  260
##          yes 2069  622
confusionMatrix(pred_naive_4, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5286  260
##        yes 2069  622
##                                              
##                Accuracy : 0.7173             
##                  95% CI : (0.7074, 0.727)    
##     No Information Rate : 0.8929             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2228             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.70522            
##             Specificity : 0.71869            
##          Pos Pred Value : 0.23114            
##          Neg Pred Value : 0.95312            
##              Prevalence : 0.10708            
##          Detection Rate : 0.07551            
##    Detection Prevalence : 0.32670            
##       Balanced Accuracy : 0.71196            
##                                              
##        'Positive' Class : yes                
## 

Naive Bayes : Summary

Our best Naive Bayes model is the “model_naive_3” gives an accuracy of prediction of 89% with 95% Confidence interval of accuracy between 88% and 89%. The “model_naive_3” uses predictors that relate to data describing contact or interaction with respondents. The autocorrelation concerns have been removed in this model by excluding predictors that show high correlation with each other.

Decision Tree

Decision Tree modelling is a classification predictive modelling method. The outcome (dependent) variable is a categorical variable (binary) and predictor (independent) variables can be continuous or categorical variables (binary). The decision trees can be used for both regression and classification. A decision tree assumes independence of predictor variables and does not assume linearity relationships between predictors.

The main benefits of decision trees are that they require minimal data preparation, they do not require feature scaling, can handle missing values automatically, and have short training process times.

The main decision tree issues are overfitting (performs well on training data only and not on unseen data), generation of new nodes to fit data which can make it highly complex due to a small number of ‘noisy’ data, and that it is not suitable for large datasets.

knitr::include_graphics('DT_Example.jpg')

To begin building this model we confirm that relevant “character” type predictors and targets are in “factor” type form.

str(bankmart_train_down)
## 'data.frame':    7516 obs. of  21 variables:
##  $ age           : int  27 30 58 53 45 37 39 50 53 36 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 6 11 2 10 10 3 6 10 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 3 2 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 2 4 3 3 3 4 6 3 6 7 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 2 2 2 1 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 3 3 1 2 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 1 2 1 2 2 1 1 1 1 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 4 7 5 7 7 5 7 4 2 2 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 4 4 2 4 3 4 1 3 2 1 ...
##  $ duration      : int  68 414 198 205 525 548 264 129 119 248 ...
##  $ campaign      : int  3 1 1 2 1 3 1 1 3 1 ...
##  $ pdays         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 1 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.4 -1.8 1.4 -1.8 1.1 1.4 -1.8 1.4 1.4 1.4 ...
##  $ cons.price.idx: num  93.9 92.9 94.5 92.9 94 ...
##  $ cons.conf.idx : num  -42.7 -46.2 -41.8 -46.2 -36.4 -41.8 -46.2 -42.7 -36.1 -36.1 ...
##  $ euribor3m     : num  4.96 1.34 4.87 1.29 4.86 ...
##  $ nr.employed   : num  5228 5099 5228 5099 5191 ...
##  $ Sale          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Decision Tree : Evaluation

library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
set.seed(123)

bankmart_train_dt <- ctree(formula = Sale ~ euribor3m + housing +loan + euribor3m + duration + campaign + pdays + previous + poutcome, data =bankmart_train_down)
plot(bankmart_train_dt, type = "simple")

pred_dt <- predict(bankmart_train_dt, newdata =bankmart_test, type="response")
confusionMatrix(pred_dt, reference = bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6075   69
##        yes 1280  813
##                                              
##                Accuracy : 0.8362             
##                  95% CI : (0.8281, 0.8442)   
##     No Information Rate : 0.8929             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.4661             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.9218             
##             Specificity : 0.8260             
##          Pos Pred Value : 0.3884             
##          Neg Pred Value : 0.9888             
##              Prevalence : 0.1071             
##          Detection Rate : 0.0987             
##    Detection Prevalence : 0.2541             
##       Balanced Accuracy : 0.8739             
##                                              
##        'Positive' Class : yes                
## 

Decision Tree : Summary

Our Decision Tree model achieved an accuracy of 83% with a 95% Confidence Interval between an accuracy of 82% to 84%, with a Sensitivity of 92%. The Decision Tree model used the predictors “euribor3m”, “housing”, “loan”, “duration”, “campaign”, “pdays”, “previous”, and “poutcome”.

Random Forest

Random forest chooses a random subset of features and builds many Decision Trees. The model is an averaging of all the predictions of the Decisions trees. It performs “bagging” or (bootstrap aggregation) and “boosting train”. The first Creates subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees). The second creates multiple predictive models to generate the best one based on a voting mechanism or averaging of output. This overcomes the problem of overfitting in decision tree.

library(randomForest)

We use mtry = 4.

# Define the control

trControl <- trainControl(method = "cv",
    number = 10,
    search = "grid")

tuneGrid <- expand.grid(.mtry = c(10))

We set ntree=300 with “nodesize” and “maxnodes” set to default values, and using “euribor3m”, “age”, “education” , “job”, “default”, “housing”, “loan”, “previous”, “pdays”, “campaign”, and “duration” predictors. The following code has been commented to save comsiderable time knitting the document.

# set.seed(1234)
# 
# fit_rf <- train(Sale~euribor3m + age + education + job + default + housing + loan +                 previous + pdays + campaign + duration,
#     bankmart_train,
#     method = "rf",
#     metric = "Accuracy",
#     tuneGrid = tuneGrid,
#     trControl = trControl,
#     importance = TRUE,
#     # nodesize = 10,
#     ntree = 300,
#     # maxnodes = 10
#     )

We save the newly built model for future ease of recall.

# saving the model
# saveRDS(fit_rf, file = "fit_rf.rds")
fit_rf<-readRDS("fit_rf.rds")

Next we use the model to predict target using the “test” set of predictors.

prediction <-predict(fit_rf, bankmart_test)

And the confusion Matrix indicators of the predictions.

confusionMatrix(prediction, bankmart_test$Sale, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6841  193
##        yes  514  689
##                                                
##                Accuracy : 0.9142               
##                  95% CI : (0.9079, 0.9201)     
##     No Information Rate : 0.8929               
##     P-Value [Acc > NIR] : 0.00000000007224     
##                                                
##                   Kappa : 0.6131               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.78118              
##             Specificity : 0.93012              
##          Pos Pred Value : 0.57273              
##          Neg Pred Value : 0.97256              
##              Prevalence : 0.10708              
##          Detection Rate : 0.08365              
##    Detection Prevalence : 0.14605              
##       Balanced Accuracy : 0.85565              
##                                                
##        'Positive' Class : yes                  
## 

With this random forest model we have achieved an accuracy of 91% with a 95% Confidence Interval between 90% and 92%. The sensitivity measure how ever is very low at 78%.

We always prefer that Sensitivity or Recall:

                            TP/TP+FN to approach 1.0
                            

Specificity : TN/ TN + FP to approach zero.

The reason being that this would determine how well our model makes “yes” and “no” predictions. We can check the parameters of the previous random forest model building process as follows.

fit_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 8.81%
## Confusion matrix:
##        no  yes class.error
## no  28183 1083   0.0370054
## yes  1821 1863   0.4942997

We can see above that our OOB measure registers an error rate of 8.81%.

plot(fit_rf$finalModel)
legend("topright", colnames(fit_rf$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

The plot above shows that the error of predicting a “yes” for target “Sale” stagnates at 0.5 for trees > 10.

plot(varImp(fit_rf))

The above plot shows that ‘duration’ is the most important predictor variable followed by ‘euribor3m’.

We now proceed to plot the ROC curve and compute a measure of the AUC. ROC is a curve that represents relationship between Sensitivity (TPR) and False Positive Rate (FPR) at each classification threshold. Ideally a good model has a high true positive rate and a low false positive rate.

AUC is the area under an ROC curve. A high AUC signifies a model that has a high TPR/Recall and a low FPR. An AUC close to 1 means that the model can distinguish (‘yes’ and ‘no’) classes very well, while an AUC of 0.5 means that the model is no better at predicting classes than by a random selection process.

prob_test <- predict(fit_rf,newdata=bankmart_test, type = "prob")
library(ROCR)

pred_roc <-  prediction(prob_test[,"yes"],bankmart_test$Sale)
perf <- performance(pred_roc, "tpr", "fpr")
perf
## A performance instance
##   'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
##   with 295 data points
plot(perf)

auc <-performance(pred_roc, measure = "auc")
print(auc@y.values)
## [[1]]
## [1] 0.9544342

Conclusion

The best Naive Bayes model (“Model 3”) used data describing contact information with telephone clients and this model yielded an accuracy of 88% with a sensitivity of 49 % and specificity of 93%.

The decision tree model yielded an accuracy of 83% with a sensitivity of 92% and a specificity of 82%.

The random forest model yielded an accuracy of 91% with a sensitivity of 78% and a specificity of 93%. The random forest model measures the error for classifying a ‘no’ as 3.7% and the error for classifiying a ‘yes’ as 49%. The OOB from the random forest model is 8.8% with an AUC of 0.95.