Classification Models on Bank Marketing Data

Johan Setiawan

12/16/2021

Introduction

Goal:

This is a dataset classify customers whether she or he will subscribe to a term deposit based on some parameters. It was built with the purpose of helping the Portugese Banking Institution Marketing team which will be used on Marketing Campaigns, which were often done via phone calls. The predicted output gives them a fair idea about whether a client will subscribe to a term deposit or no.

What we will do:

We will use Naive Bayes, Decision Tree and Random Forest models on Bank Marketing data from UCI ML Repository. We want to know the relationship among variables, especially between the target (term deposit subscription) with other variables. We also want to predict the chance of a client on whether he or she will subscribe to a term deposit or no

Columns description:

age : clients’ age during the data collection
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Output variable (desired target):

y: has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Import Library

library(caret)
library(dplyr)
library(e1071)
library(partykit)
library(randomForest)
library(rsample)

Read Data

bank <- read.csv("bank-full.csv", sep = ";")
head(bank)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

Data Wrangling

Check the internal structure of the data

This is to check on whether all the data types are suitable to the column

str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Insights: some columns don’t have the most appropriate data types such as job, marital, education, default, housing, loan, contact, poutcome, y.

Change data types of columns

We shouldn’t change day to factor because deision tree is not able to process a column with more than 30 classes

bank <- bank %>% 
        mutate_at(vars(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor)

bank$duration <- as.integer(bank$duration)

Remove `previous`

bank <- bank %>%  select(-previous)

Check for null values

bank %>% 
  is.na() %>% 
  colSums()

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  poutcome         y 
##         0         0         0         0         0         0         0         0

Insights: no null values from the data

Scale the numeric columns

# Separate the numeric columns from the original data
bank_numeric <- bank %>% select_if(is.numeric)
bank_non_numeric <- bank %>% select(-c(age, balance, duration, campaign, pdays))

# Scale the numeric columns
bank_numeric_s <- scale(bank_numeric)
# Put them back together
bank_clean <- cbind(bank_non_numeric, bank_numeric_s)
bank_clean <- bank_clean[,c(1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,11)]

EDA

Check the summary of the data

summary(bank_clean)

##           job           marital          education     default     housing    
##  blue-collar:9732   divorced: 5207   primary  : 6851   no :44396   no :20081  
##  management :9458   married :27214   secondary:23202   yes:  815   yes:25130  
##  technician :7597   single  :12790   tertiary :13301                          
##  admin.     :5171                    unknown  : 1857                          
##  services   :4154                                                             
##  retired    :2264                                                             
##  (Other)    :6835                                                             
##   loan            contact           day            month          poutcome    
##  no :37967   cellular :29285   Min.   : 1.00   may    :13766   failure: 4901  
##  yes: 7244   telephone: 2906   1st Qu.: 8.00   jul    : 6895   other  : 1840  
##              unknown  :13020   Median :16.00   aug    : 6247   success: 1511  
##                                Mean   :15.81   jun    : 5341   unknown:36959  
##                                3rd Qu.:21.00   nov    : 3970                  
##                                Max.   :31.00   apr    : 2932                  
##                                                (Other): 6060                  
##       age             balance             day.1             duration      
##  Min.   :-2.1600   Min.   :-3.08111   Min.   :-1.77909   Min.   :-1.0025  
##  1st Qu.:-0.7474   1st Qu.:-0.42377   1st Qu.:-0.93799   1st Qu.:-0.6025  
##  Median :-0.1823   Median :-0.30028   Median : 0.02326   Median :-0.3035  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.6652   3rd Qu.: 0.02159   3rd Qu.: 0.62404   3rd Qu.: 0.2362  
##  Max.   : 5.0913   Max.   :33.09441   Max.   : 1.82561   Max.   :18.0945  
##                                                                           
##     campaign          y        
##  Min.   :-0.56934   no :39922  
##  1st Qu.:-0.56934   yes: 5289  
##  Median :-0.24656              
##  Mean   : 0.00000              
##  3rd Qu.: 0.07623              
##  Max.   :19.44343              
##

Check the distribution of the jobs

job_distribution <- bank_clean %>%
                    group_by(job) %>% 
                    summarise(job_count = n()) %>% 
                    arrange(-job_count)

library(glue)
job_distribution_plot <- ggplot(data = job_distribution, aes(x = job_count, 
                                               y = reorder(job, job_count), 
                                               text = glue("No. of customers: {job_count}")
)) +
  geom_col(aes(fill = job)) +
  labs( title = "Job Distribution of Customers",
        x = "No. of Jobs",
        y = "jobs"
        ) +
  theme_minimal() +
  theme(legend.position = "none")
job_distribution_plot

Majority of the customers have jobs under the categories of blue-collar, management and technician.

Check the distribution of the education

education_distribution <- bank_clean %>%
                            group_by(education) %>% 
                            summarise(ed_count = n())
library(glue)
education_distribution_plot <- ggplot(data = education_distribution, aes(y = ed_count, 
                                               x = education, 
                                               text = glue("No. of customers: {ed_count}")
)) +
  geom_col(aes(fill = "brickred")) +
  labs( title = "Job Distribution of Customers",
        x = "Education Level",
        y = ""
        ) +
  theme_minimal() +
  theme(legend.position = "none")
education_distribution_plot

A huge portion of the customers’ latest education are in secondary and tertiary educations - This means that most of them are quite educated

Cross Validation

Check the proportion of the target data

prop.table(table(bank_clean$y))

## 
##        no       yes 
## 0.8830152 0.1169848

Insights: The data is not quite balanced. Thus, downsampling is needed.

Spli data into 75% train and 25% test

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(bank_clean), nrow(bank_clean)*0.75)

data_train <- bank_clean[index,]
data_test <- bank_clean[-index,]

Target downsampling

Downsampling is reducing the majority class observation until it is balanced with the minority class. This is usually done to a large dataset.

RNGkind(sample.kind = "Rounding")
set.seed(100)

library(caret)
data_train <- downSample(x = data_train %>% select(-y),
                         y = data_train$y,
                         yname = "y")

prop.table(table(data_train$y))

## 
##  no yes 
## 0.5 0.5

Modelling

In this phase, we are doing model of the training data to the respective models

Naive Bayes

bank_naive <- naiveBayes(x = data_train %>% select(-y),
                         y = data_train$y)
bank_naive

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = data_train %>% select(-y), y = data_train$y)
## 
## A-priori probabilities:
## data_train$y
##  no yes 
## 0.5 0.5 
## 
## Conditional probabilities:
##             job
## data_train$y      admin. blue-collar entrepreneur   housemaid  management
##          no  0.108958228 0.229994967  0.037242073 0.032209361 0.199547056
##          yes 0.123553095 0.137393055  0.022647207 0.020130851 0.242576749
##             job
## data_train$y     retired self-employed    services     student  technician
##          no  0.039255159   0.034977353 0.097131354 0.015098138 0.170105687
##          yes 0.100150981   0.035228988 0.068444892 0.048817313 0.155762456
##             job
## data_train$y  unemployed     unknown
##          no  0.027428284 0.008052340
##          yes 0.039255159 0.006039255
## 
##             marital
## data_train$y  divorced   married    single
##          no  0.1149975 0.6114746 0.2735279
##          yes 0.1185204 0.5266734 0.3548062
## 
##             education
## data_train$y    primary  secondary   tertiary    unknown
##          no  0.15601409 0.52063412 0.28309009 0.04026170
##          yes 0.11172622 0.46502265 0.37695018 0.04630096
## 
##             default
## data_train$y         no        yes
##          no  0.97810770 0.02189230
##          yes 0.98968294 0.01031706
## 
##             housing
## data_train$y        no       yes
##          no  0.4272773 0.5727227
##          yes 0.6356316 0.3643684
## 
##             loan
## data_train$y         no        yes
##          no  0.84197282 0.15802718
##          yes 0.90865627 0.09134373
## 
##             contact
## data_train$y   cellular  telephone    unknown
##          no  0.63839960 0.05913437 0.30246603
##          yes 0.82637141 0.07196779 0.10166080
## 
##             day
## data_train$y     [,1]     [,2]
##          no  15.84977 8.367192
##          yes 15.09864 8.542359
## 
##             month
## data_train$y         apr         aug         dec         feb         jan
##          no  0.056114746 0.152994464 0.001761449 0.051836940 0.031454454
##          yes 0.106441872 0.129843986 0.018117765 0.081529945 0.028183191
##             month
## data_train$y         jul         jun         mar         may         nov
##          no  0.158782084 0.118268747 0.003522899 0.325868143 0.080523402
##          yes 0.118772018 0.106945143 0.049068948 0.171867136 0.078258681
##             month
## data_train$y         oct         sep
##          no  0.010568697 0.008303976
##          yes 0.059134373 0.051836940
## 
##             poutcome
## data_train$y    failure      other    success    unknown
##          no  0.10442879 0.04101661 0.01333669 0.84121792
##          yes 0.11499748 0.05712129 0.18595873 0.64192250
## 
##             age
## data_train$y        [,1]      [,2]
##          no  0.002831838 0.9493634
##          yes 0.078165447 1.2821592
## 
##             balance
## data_train$y         [,1]     [,2]
##          no  -0.008521819 1.066287
##          yes  0.140457353 1.101335
## 
##             day.1
## data_train$y         [,1]     [,2]
##          no   0.005209355 1.005373
##          yes -0.085044116 1.026420
## 
##             duration
## data_train$y       [,1]      [,2]
##          no  -0.1484305 0.7969071
##          yes  1.0630080 1.5063952
## 
##             campaign
## data_train$y        [,1]      [,2]
##          no   0.04739435 1.0973483
##          yes -0.19571100 0.6306424

Decision Tree

bank_dt <- ctree(formula = y ~.,
                 data = data_train)
bank_dt

## 
## Model formula:
## y ~ job + marital + education + default + housing + loan + contact + 
##     day + month + poutcome + age + balance + day.1 + duration + 
##     campaign
## 
## Fitted party:
## [1] root
## |   [2] duration <= -0.19867
## |   |   [3] month in apr, dec, feb, mar, oct, sep
## |   |   |   [4] duration <= -0.65299
## |   |   |   |   [5] duration <= -0.76948
## |   |   |   |   |   [6] age <= 1.4186: no (n = 67, err = 0.0%)
## |   |   |   |   |   [7] age > 1.4186: no (n = 7, err = 14.3%)
## |   |   |   |   [8] duration > -0.76948: no (n = 83, err = 34.9%)
## |   |   |   [9] duration > -0.65299
## |   |   |   |   [10] poutcome in failure, other, unknown
## |   |   |   |   |   [11] housing in no
## |   |   |   |   |   |   [12] month in apr, dec, mar, oct: yes (n = 217, err = 23.0%)
## |   |   |   |   |   |   [13] month in feb, sep
## |   |   |   |   |   |   |   [14] day.1 <= -0.93799: no (n = 48, err = 25.0%)
## |   |   |   |   |   |   |   [15] day.1 > -0.93799: yes (n = 53, err = 20.8%)
## |   |   |   |   |   [16] housing in yes
## |   |   |   |   |   |   [17] month in apr, dec, feb: no (n = 119, err = 25.2%)
## |   |   |   |   |   |   [18] month in mar, oct, sep: yes (n = 48, err = 20.8%)
## |   |   |   |   [19] poutcome in success
## |   |   |   |   |   [20] campaign <= -0.24656: yes (n = 78, err = 5.1%)
## |   |   |   |   |   [21] campaign > -0.24656: yes (n = 15, err = 33.3%)
## |   |   [22] month in aug, jan, jul, jun, may, nov
## |   |   |   [23] poutcome in failure, other, unknown
## |   |   |   |   [24] duration <= -0.50932
## |   |   |   |   |   [25] job in admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, technician, unknown
## |   |   |   |   |   |   [26] day.1 <= -1.41862: no (n = 85, err = 14.1%)
## |   |   |   |   |   |   [27] day.1 > -1.41862
## |   |   |   |   |   |   |   [28] poutcome in failure, other
## |   |   |   |   |   |   |   |   [29] contact in cellular
## |   |   |   |   |   |   |   |   |   [30] month in aug, jul, jun: no (n = 11, err = 27.3%)
## |   |   |   |   |   |   |   |   |   [31] month in jan, may, nov: no (n = 116, err = 2.6%)
## |   |   |   |   |   |   |   |   [32] contact in telephone, unknown: no (n = 16, err = 18.8%)
## |   |   |   |   |   |   |   [33] poutcome in unknown: no (n = 1025, err = 1.7%)
## |   |   |   |   |   [34] job in student, unemployed: no (n = 53, err = 22.6%)
## |   |   |   |   [35] duration > -0.50932
## |   |   |   |   |   [36] contact in cellular, telephone
## |   |   |   |   |   |   [37] month in aug, jan, jul, may, nov
## |   |   |   |   |   |   |   [38] housing in no
## |   |   |   |   |   |   |   |   [39] poutcome in failure, other: yes (n = 55, err = 36.4%)
## |   |   |   |   |   |   |   |   [40] poutcome in unknown
## |   |   |   |   |   |   |   |   |   [41] month in aug, jul
## |   |   |   |   |   |   |   |   |   |   [42] job in admin., blue-collar, management, self-employed, services, technician, unknown: no (n = 178, err = 7.3%)
## |   |   |   |   |   |   |   |   |   |   [43] job in entrepreneur, housemaid, retired, student, unemployed: no (n = 40, err = 42.5%)
## |   |   |   |   |   |   |   |   |   [44] month in jan, may, nov
## |   |   |   |   |   |   |   |   |   |   [45] day <= 14: yes (n = 19, err = 15.8%)
## |   |   |   |   |   |   |   |   |   |   [46] day > 14
## |   |   |   |   |   |   |   |   |   |   |   [47] month in jan, nov: no (n = 47, err = 12.8%)
## |   |   |   |   |   |   |   |   |   |   |   [48] month in may: yes (n = 9, err = 33.3%)
## |   |   |   |   |   |   |   [49] housing in yes
## |   |   |   |   |   |   |   |   [50] month in aug, jan, nov: no (n = 104, err = 19.2%)
## |   |   |   |   |   |   |   |   [51] month in jul, may
## |   |   |   |   |   |   |   |   |   [52] balance <= 0.76877: no (n = 189, err = 3.7%)
## |   |   |   |   |   |   |   |   |   [53] balance > 0.76877: no (n = 11, err = 27.3%)
## |   |   |   |   |   |   [54] month in jun: yes (n = 62, err = 29.0%)
## |   |   |   |   |   [55] contact in unknown
## |   |   |   |   |   |   [56] month in jul, nov: no (n = 12, err = 41.7%)
## |   |   |   |   |   |   [57] month in jun, may: no (n = 300, err = 0.0%)
## |   |   |   [58] poutcome in success
## |   |   |   |   [59] duration <= -0.48602: yes (n = 32, err = 40.6%)
## |   |   |   |   [60] duration > -0.48602: yes (n = 99, err = 12.1%)
## |   [61] duration > -0.19867
## |   |   [62] duration <= 1.22642
## |   |   |   [63] month in apr, dec, feb, mar, oct, sep
## |   |   |   |   [64] month in apr, feb
## |   |   |   |   |   [65] housing in no
## |   |   |   |   |   |   [66] day <= 7: yes (n = 92, err = 42.4%)
## |   |   |   |   |   |   [67] day > 7
## |   |   |   |   |   |   |   [68] poutcome in failure, other: yes (n = 34, err = 26.5%)
## |   |   |   |   |   |   |   [69] poutcome in success, unknown: yes (n = 209, err = 6.7%)
## |   |   |   |   |   [70] housing in yes
## |   |   |   |   |   |   [71] day <= 20
## |   |   |   |   |   |   |   [72] month in apr: no (n = 94, err = 29.8%)
## |   |   |   |   |   |   |   [73] month in feb
## |   |   |   |   |   |   |   |   [74] day.1 <= -1.29846: no (n = 32, err = 34.4%)
## |   |   |   |   |   |   |   |   [75] day.1 > -1.29846: yes (n = 19, err = 5.3%)
## |   |   |   |   |   |   [76] day > 20: yes (n = 37, err = 5.4%)
## |   |   |   |   [77] month in dec, mar, oct, sep: yes (n = 423, err = 4.0%)
## |   |   |   [78] month in aug, jan, jul, jun, may, nov
## |   |   |   |   [79] poutcome in failure, other, unknown
## |   |   |   |   |   [80] contact in cellular, telephone
## |   |   |   |   |   |   [81] duration <= 0.58183
## |   |   |   |   |   |   |   [82] month in aug, jan, jul, may, nov
## |   |   |   |   |   |   |   |   [83] housing in no
## |   |   |   |   |   |   |   |   |   [84] job in admin., management, retired, student, unemployed
## |   |   |   |   |   |   |   |   |   |   [85] day <= 27
## |   |   |   |   |   |   |   |   |   |   |   [86] month in aug, jul: yes (n = 147, err = 43.5%)
## |   |   |   |   |   |   |   |   |   |   |   [87] month in jan, may, nov: yes (n = 102, err = 20.6%)
## |   |   |   |   |   |   |   |   |   |   [88] day > 27: no (n = 26, err = 23.1%)
## |   |   |   |   |   |   |   |   |   [89] job in blue-collar, entrepreneur, housemaid, self-employed, services, technician, unknown
## |   |   |   |   |   |   |   |   |   |   [90] day <= 4: yes (n = 20, err = 20.0%)
## |   |   |   |   |   |   |   |   |   |   [91] day > 4
## |   |   |   |   |   |   |   |   |   |   |   [92] balance <= -0.35217
## |   |   |   |   |   |   |   |   |   |   |   |   [93] duration <= 0.36049
## |   |   |   |   |   |   |   |   |   |   |   |   |   [94] month in aug, jan, jul, nov: no (n = 70, err = 2.9%)
## |   |   |   |   |   |   |   |   |   |   |   |   |   [95] month in may: no (n = 8, err = 50.0%)
## |   |   |   |   |   |   |   |   |   |   |   |   [96] duration > 0.36049: yes (n = 15, err = 46.7%)
## |   |   |   |   |   |   |   |   |   |   |   [97] balance > -0.35217: no (n = 140, err = 39.3%)
## |   |   |   |   |   |   |   |   [98] housing in yes
## |   |   |   |   |   |   |   |   |   [99] month in aug
## |   |   |   |   |   |   |   |   |   |   [100] poutcome in failure, other: yes (n = 13, err = 0.0%)
## |   |   |   |   |   |   |   |   |   |   [101] poutcome in unknown: no (n = 28, err = 32.1%)
## |   |   |   |   |   |   |   |   |   [102] month in jan, jul, may, nov
## |   |   |   |   |   |   |   |   |   |   [103] duration <= 0.34108: no (n = 257, err = 16.3%)
## |   |   |   |   |   |   |   |   |   |   [104] duration > 0.34108: no (n = 85, err = 32.9%)
## |   |   |   |   |   |   |   [105] month in jun: yes (n = 92, err = 7.6%)
## |   |   |   |   |   |   [106] duration > 0.58183
## |   |   |   |   |   |   |   [107] housing in no: yes (n = 259, err = 25.1%)
## |   |   |   |   |   |   |   [108] housing in yes: yes (n = 184, err = 42.9%)
## |   |   |   |   |   [109] contact in unknown
## |   |   |   |   |   |   [110] duration <= 0.66726
## |   |   |   |   |   |   |   [111] month in jul, may: no (n = 242, err = 1.2%)
## |   |   |   |   |   |   |   [112] month in jun, nov: no (n = 113, err = 8.8%)
## |   |   |   |   |   |   [113] duration > 0.66726: no (n = 139, err = 41.7%)
## |   |   |   |   [114] poutcome in success: yes (n = 232, err = 3.4%)
## |   |   [115] duration > 1.22642
## |   |   |   [116] contact in cellular, telephone: yes (n = 1279, err = 9.7%)
## |   |   |   [117] contact in unknown: yes (n = 359, err = 18.9%)
## 
## Number of inner nodes:    58
## Number of terminal nodes: 59

Random Forest

Get rid of the predictors that have near 0 variance or less informative to reduce computational time

n0_var <- nearZeroVar(bank_clean)
bank_new <- bank_clean[,-n0_var]

Cross validation for Random Forest model

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = bank_new , prop = 0.75, strata = "y")
rf_train <- training(index)
rf_test <- testing(index)

Check the target proportiton

prop.table(table(rf_train$y))

## 
##       no      yes 
## 0.883033 0.116967

Downsample the target

RNGkind(sample.kind = "Rounding")
set.seed(100)

library(caret)
rf_train <- downSample(x = rf_train %>% select(-y),
                         y = rf_train$y,
                         yname = "y")

prop.table(table(rf_train$y))

## 
##  no yes 
## 0.5 0.5

#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv",
#                      number = 5, # k-fold
#                      repeats = 3) # repetition
 
#bank_rf <- train(y ~ .,
#                    data = rf_train,
#                    method = "rf", # random forest
#                   trControl = ctrl)
 
#saveRDS(bank_rf, "bank_rf.RDS")

bank_rf <- readRDS("bank_rf.RDS")
library(randomForest)
bank_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 21
## 
##         OOB estimate of  error rate: 14.12%
## Confusion matrix:
##       no  yes class.error
## no  3271  695   0.1752395
## yes  425 3541   0.1071609

Model Prediction and Evaluation

Naive Bayes

Predict

bank_naive_pred <- predict(object = bank_naive,
                            newdata = data_test,
                            type = "class")

Confusion matrix

confusionMatrix(data = bank_naive_pred,
                reference = data_test$y) #data actual

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7861  210
##        yes 2127 1105
##                                           
##                Accuracy : 0.7932          
##                  95% CI : (0.7857, 0.8007)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3842          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7870          
##             Specificity : 0.8403          
##          Pos Pred Value : 0.9740          
##          Neg Pred Value : 0.3419          
##              Prevalence : 0.8837          
##          Detection Rate : 0.6955          
##    Detection Prevalence : 0.7141          
##       Balanced Accuracy : 0.8137          
##                                           
##        'Positive' Class : no              
##

AUC and ROC

AUC and ROC act as a evaluation tools which provide another perspective on how good a model is in classifying the two target classes.

# Take the prediction results in terms of probability
bank_naive_pred_prob <- predict(object = bank_naive,
                           newdata = data_test,
                           type = "raw")
head(bank_naive_pred_prob)

##             no        yes
## [1,] 0.9501237 0.04987630
## [2,] 0.9685047 0.03149529
## [3,] 0.9240637 0.07593627
## [4,] 0.9800660 0.01993405
## [5,] 0.8910018 0.10899816
## [6,] 0.9837691 0.01623089

Prepare a data frame for ROC. Assuming the positive class is.

# menyiapkan pred vs actual
data_roc <- data.frame(pred_prob = bank_naive_pred_prob[, 'yes'],
                       actual = ifelse(data_test$y == 'yes',1, 0))
head(data_roc, 10)

##     pred_prob actual
## 1  0.04987630      0
## 2  0.03149529      0
## 3  0.07593627      0
## 4  0.01993405      0
## 5  0.10899816      0
## 6  0.01623089      0
## 7  0.06470933      0
## 8  0.04177653      0
## 9  0.01599570      0
## 10 0.05203860      0

Make ROC by preparing prediction() object

library(ROCR)

# objek prediction
bank_naive_roc <- prediction(predictions = data_roc$pred_prob,
                        labels = data_roc$actual)
# ROC curve
plot(performance(bank_naive_roc,"tpr","fpr"))

The nearer the curve to a 90 degree angle, the better the models on separating the classes

Area Under Curve (AUC)

AUC shows the area under ROC. If it is nearer to 1, the performance of the model is better in differentiating both classes. To obtain AUC value, we use measure = "auc" on performance() and get the y.values.

# nilai AUC
bank_naive_auc <- performance(bank_naive_roc, measure = "auc")
bank_naive_auc@y.values

## [[1]]
## [1] 0.8787196

AUC = 0.878, thus votes_naive is decent in differentiating the yes and no classes on whether the customer will subscribe to the term deposit, but still can be improved.

Decision Tree

Predict

bank_dt_pred <- predict(object = bank_dt,
                          newdata = data_test,
                          type = "response")

Confusion matrix

confusionMatrix(data = bank_dt_pred,
                reference = data_test$y) #data actual

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8190  147
##        yes 1798 1168
##                                           
##                Accuracy : 0.8279          
##                  95% CI : (0.8208, 0.8348)
##     No Information Rate : 0.8837          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4583          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8200          
##             Specificity : 0.8882          
##          Pos Pred Value : 0.9824          
##          Neg Pred Value : 0.3938          
##              Prevalence : 0.8837          
##          Detection Rate : 0.7246          
##    Detection Prevalence : 0.7376          
##       Balanced Accuracy : 0.8541          
##                                           
##        'Positive' Class : no              
##

See whethr the dt model overfits

bank_dt_pred_train <- predict(object = bank_dt,
                          newdata = data_train,
                          type = "response")

Confusion matrix

confusionMatrix(data = bank_dt_pred_train,
                reference = data_train$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3296  449
##        yes  678 3525
##                                           
##                Accuracy : 0.8582          
##                  95% CI : (0.8503, 0.8658)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7164          
##                                           
##  Mcnemar's Test P-Value : 1.109e-11       
##                                           
##             Sensitivity : 0.8294          
##             Specificity : 0.8870          
##          Pos Pred Value : 0.8801          
##          Neg Pred Value : 0.8387          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4147          
##    Detection Prevalence : 0.4712          
##       Balanced Accuracy : 0.8582          
##                                           
##        'Positive' Class : no              
##

No, the decision tree doesn’t overfits by comparing the precision of the models on data_train and data_test

Random Forest

Predict

bank_rf_pred <- predict(object = bank_rf,
                          newdata = rf_test,
                          type = "raw")

Confusion matrix

confusionMatrix(data = bank_rf_pred,
                reference = rf_test$y) #data actual

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8283  125
##        yes 1698 1198
##                                           
##                Accuracy : 0.8387          
##                  95% CI : (0.8318, 0.8455)
##     No Information Rate : 0.883           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4852          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8299          
##             Specificity : 0.9055          
##          Pos Pred Value : 0.9851          
##          Neg Pred Value : 0.4137          
##              Prevalence : 0.8830          
##          Detection Rate : 0.7327          
##    Detection Prevalence : 0.7438          
##       Balanced Accuracy : 0.8677          
##                                           
##        'Positive' Class : no              
##

Check some of the most important predictors

varImp(bank_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## duration        100.000
## balance          19.647
## age              18.239
## poutcomesuccess  15.380
## contactunknown   12.365
## day.1            12.043
## day              11.919
## housingyes        8.888
## campaign          7.215
## poutcomeunknown   5.097
## monthmar          3.798
## monthaug          3.281
## monthjul          3.120
## loanyes           2.937
## monthoct          2.859
## monthmay          2.850
## jobblue-collar    2.473
## monthnov          2.452
## monthjun          2.403
## maritalmarried    2.281

Comparing the Three Models

Conclusion

The best model in this case is Random Forest by comparing the precision (Pos Pred) value.
3 most important variables that affect whether a customer will subscribe to the bank term deposits are duration, balance and age.
There are some improvements that can be made by tuning to some parameters on the decision tree model such as mincriterion, minsplit, minbucket.

Citation

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Classification Models on Bank Marketing Data

Introduction

Goal:

What we will do:

Columns description:

related with the last contact of the current campaign:

other attributes:

Output variable (desired target):

Import Library

Read Data

Data Wrangling

Check the internal structure of the data

Change data types of columns

We shouldn’t change day to factor because deision tree is not able to process a column with more than 30 classes

Remove previous

Check for null values

Scale the numeric columns

EDA

Check the summary of the data

Check the distribution of the jobs

Check the distribution of the education

Cross Validation

Check the proportion of the target data

Spli data into 75% train and 25% test

Target downsampling

Modelling

Naive Bayes

Decision Tree

Random Forest

Get rid of the predictors that have near 0 variance or less informative to reduce computational time

Cross validation for Random Forest model

Check the target proportiton

Downsample the target

Model Prediction and Evaluation

Naive Bayes

Predict

Confusion matrix

AUC and ROC

Area Under Curve (AUC)

Decision Tree

Predict

Confusion matrix

See whethr the dt model overfits

Confusion matrix

Random Forest

Predict

Confusion matrix

Check some of the most important predictors

Comparing the Three Models

Since we are looking to identify how accurate on our model can get whether a customer will subscribe to the term deposit, we should look at precision as our metric:

Conclusion

Citation

Remove `previous`