Executive summary

Telemarketing nowadays is not something new, but it is still one of most important way to gain more potential customers. Some company focus on training on their marketer in that hope that such investment will payoff. However, even the best saleman in the world fails sometime. It is usually not the fault of saleman but rather the problem lies on the gap of understanding client. Certain group of clients share same charateristics will be more likely interested in while those have same background will shy away from the offer. A clear example is those who work at hazard environments have higher chance to sign up insurance than those work in the office.

Aspired from the orginal paper: “A Data-Driven Approach to Predict the Success of Bank Telemarketing” by S. Moro, P. Cortez and P. Rita in 2014, 5 machine learning models were built to predict whether the client subscribed a term deposit. Nevertheless, unlike the paper, dataset was preprocessed in simple way, formula was selected differently and the 4 models was not heavily modified.

Data

The dataset is based on “Bank Marketing” UCI dataset.Nevertheless, it is dissimilar to the dataset used by S. Moro, P. Cortez and P. Rita, which was not enriched by the Banco de Portugal. According to the author of dataset, by the addition of five new social and economic features/attributes, the probability of being correctly predicted improved significant in comparison with the orginal.

Number of observations: 41188 Number of features: 20 and 1 target value

## Observations: 41,188
## Variables: 21
## $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25,...
## $ job            <fct> housemaid, services, services, admin., services...
## $ marital        <fct> married, married, married, married, married, ma...
## $ education      <fct> basic.4y, high.school, high.school, basic.6y, h...
## $ default        <fct> no, unknown, no, no, no, unknown, no, unknown, ...
## $ housing        <fct> no, no, yes, no, no, no, no, no, yes, yes, no, ...
## $ loan           <fct> no, no, no, no, yes, no, no, no, no, no, no, no...
## $ contact        <fct> telephone, telephone, telephone, telephone, tel...
## $ month          <fct> may, may, may, may, may, may, may, may, may, ma...
## $ day_of_week    <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mo...
## $ duration       <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50...
## $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poutcome       <fct> nonexistent, nonexistent, nonexistent, nonexist...
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,...
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4...
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857...
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191,...
## $ y              <fct> no, no, no, no, no, no, no, no, no, no, no, no,...

Infomation about variables, cited from source:

Input variables:

# bank client data:

1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)

3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)

4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)

5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)

6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)

7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

# related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: “cellular”,“telephone”)

9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”) # social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

options(scipen=999)
summary(bank)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing     
##  university.degree  :12168   no     :32588   no     :18622  
##  high.school        : 9515   unknown: 8597   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576  
##  professional.course: 5243                                  
##  basic.4y           : 4176                                  
##  basic.6y           : 2292                                  
##  (Other)            : 1749                                  
##       loan            contact          month       day_of_week
##  no     :33950   cellular :26144   may    :13769   fri:7827   
##  unknown:  990   telephone:15044   jul    : 7174   mon:8514   
##  yes    : 6248                     aug    : 6178   thu:8623   
##                                    jun    : 5318   tue:8090   
##                                    nov    : 4101   wed:8134   
##                                    apr    : 2632              
##                                    (Other): 2016              
##     duration         campaign          pdays          previous    
##  Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
##  1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
##  Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
##  Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
##  3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
##  Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
##                                                                   
##         poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
##                      Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##                      3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                      Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                        
##    euribor3m      nr.employed     y        
##  Min.   :0.634   Min.   :4964   no :36548  
##  1st Qu.:1.344   1st Qu.:5099   yes: 4640  
##  Median :4.857   Median :5191              
##  Mean   :3.621   Mean   :5167              
##  3rd Qu.:4.961   3rd Qu.:5228              
##  Max.   :5.045   Max.   :5228              
##

Visualization

For quantitative variables

For quanlitative variables

## Warning: attributes are not identical across measure variables;
## they will be dropped

Unknown value as NA

Noted from the dataset’s author, unknown in categorial columns denote to missing value. Here is an instance from martial column

table(bank$marital)

## 
## divorced  married   single  unknown 
##     4612    24928    11568       80

bank[bank=='unknown'] <- NA

#missing value check
sum(is.na(bank))

## [1] 12718

The number of missing value is considerate, it could have an negative impact.

md.pattern(bank) #library mice 3.3.0

##       age contact month day_of_week duration campaign pdays previous
## 30488   1       1     1           1        1        1     1        1
## 7757    1       1     1           1        1        1     1        1
## 1102    1       1     1           1        1        1     1        1
## 456     1       1     1           1        1        1     1        1
## 739     1       1     1           1        1        1     1        1
## 207     1       1     1           1        1        1     1        1
## 22      1       1     1           1        1        1     1        1
## 16      1       1     1           1        1        1     1        1
## 116     1       1     1           1        1        1     1        1
## 76      1       1     1           1        1        1     1        1
## 54      1       1     1           1        1        1     1        1
## 70      1       1     1           1        1        1     1        1
## 1       1       1     1           1        1        1     1        1
## 2       1       1     1           1        1        1     1        1
## 2       1       1     1           1        1        1     1        1
## 59      1       1     1           1        1        1     1        1
## 7       1       1     1           1        1        1     1        1
## 2       1       1     1           1        1        1     1        1
## 2       1       1     1           1        1        1     1        1
## 1       1       1     1           1        1        1     1        1
## 4       1       1     1           1        1        1     1        1
## 3       1       1     1           1        1        1     1        1
## 2       1       1     1           1        1        1     1        1
##         0       0     0           0        0        0     0        0
##       poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 30488        1            1              1             1         1
## 7757         1            1              1             1         1
## 1102         1            1              1             1         1
## 456          1            1              1             1         1
## 739          1            1              1             1         1
## 207          1            1              1             1         1
## 22           1            1              1             1         1
## 16           1            1              1             1         1
## 116          1            1              1             1         1
## 76           1            1              1             1         1
## 54           1            1              1             1         1
## 70           1            1              1             1         1
## 1            1            1              1             1         1
## 2            1            1              1             1         1
## 2            1            1              1             1         1
## 59           1            1              1             1         1
## 7            1            1              1             1         1
## 2            1            1              1             1         1
## 2            1            1              1             1         1
## 1            1            1              1             1         1
## 4            1            1              1             1         1
## 3            1            1              1             1         1
## 2            1            1              1             1         1
##              0            0              0             0         0
##       nr.employed y marital job housing loan education default      
## 30488           1 1       1   1       1    1         1       1     0
## 7757            1 1       1   1       1    1         1       0     1
## 1102            1 1       1   1       1    1         0       1     1
## 456             1 1       1   1       1    1         0       0     2
## 739             1 1       1   1       0    0         1       1     2
## 207             1 1       1   1       0    0         1       0     3
## 22              1 1       1   1       0    0         0       1     3
## 16              1 1       1   1       0    0         0       0     4
## 116             1 1       1   0       1    1         1       1     1
## 76              1 1       1   0       1    1         1       0     2
## 54              1 1       1   0       1    1         0       1     2
## 70              1 1       1   0       1    1         0       0     3
## 1               1 1       1   0       0    0         1       1     3
## 2               1 1       1   0       0    0         1       0     4
## 2               1 1       1   0       0    0         0       0     5
## 59              1 1       0   1       1    1         1       1     1
## 7               1 1       0   1       1    1         1       0     2
## 2               1 1       0   1       1    1         0       1     2
## 2               1 1       0   1       1    1         0       0     3
## 1               1 1       0   1       0    0         1       1     3
## 4               1 1       0   0       1    1         1       1     2
## 3               1 1       0   0       1    1         0       1     3
## 2               1 1       0   0       1    1         0       0     4
##                 0 0      80 330     990  990      1731    8597 12718

#library mice 3.5.0: md.pattern(bank, rotate.names = TRUE)

aggr(bank,
     numbers = TRUE,
     prop = FALSE,
     sortVars = TRUE,
     cex.axis = .5,
     gap = 2,
     ylab = c("Number of misisngs", "Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable Count
##         default  8597
##       education  1731
##         housing   990
##            loan   990
##             job   330
##         marital    80
##             age     0
##         contact     0
##           month     0
##     day_of_week     0
##        duration     0
##        campaign     0
##           pdays     0
##        previous     0
##        poutcome     0
##    emp.var.rate     0
##  cons.price.idx     0
##   cons.conf.idx     0
##       euribor3m     0
##     nr.employed     0
##               y     0

Eventhough it is hard to clarify which columns have missing value, the amount of missing value is well spreaded. Together with large number of missing values, a simple remove observations will likely to cause the data imbalance and have a significant impact on the prediction

NA processing

One of popular method when it comes to categorial value is impletment Mode - assign missing value to the most popular value. In R there is no function for this. Fortunately, a post from Github delivered a perfect solution: (https://gist.github.com/jmarhee/8530768).

source("Mode.R")

bank <- bank %>% mutate_if(is.factor, 
                           funs(replace(.,is.na(.), Mode(na.omit(.))))) %>%
                                  mutate_if(is.factor, factor)
sum(is.na(bank))

## [1] 0

Features selection

A model with many features can suffer from overfitting and perform poorly in test data. Also it could increase time complexity and memory complexity unneccessary. Best practice is to find the least unimportant via test and analysis, then remove them from the final formula

For qualitative variables

The basic idea to use chi squared test to rank the most important feature to the least

cate_vars <- c('job','marital','education','default','housing','poutcome',
               'loan','contact','month','day_of_week') 
chi2_weights <- chi.squared(y ~ ., bank[,c("y", cate_vars)])
chi2_weights

##             attr_importance
## job             0.152688727
## marital         0.054152670
## education       0.068309967
## default         0.003041013
## housing         0.011085158
## poutcome        0.320487960
## loan            0.004466117
## contact         0.144773056
## month           0.274394871
## day_of_week     0.025194658

For quantitative variable

Checking variables have many zero value or near zero value

In many cases, a column with many zero value or near zero value offer no important contribution to the model.

num_vars <- c('age','duration','campaign','pdays','previous','emp.var.rate',
              'cons.price.idx','cons.conf.idx', 'euribor3m','nr.employed')
nearZeroVar(bank,
            saveMetrics = TRUE)

##                   freqRatio percentUnique zeroVar   nzv
## age                1.054713   0.189375546   FALSE FALSE
## job                1.161876   0.026706808   FALSE FALSE
## marital            2.161826   0.007283675   FALSE FALSE
## education          1.460746   0.016995241   FALSE FALSE
## default        13728.333333   0.004855783   FALSE  TRUE
## housing            1.211793   0.004855783   FALSE FALSE
## loan               5.592190   0.004855783   FALSE FALSE
## contact            1.737836   0.004855783   FALSE FALSE
## month              1.919292   0.024278916   FALSE FALSE
## day_of_week        1.012802   0.012139458   FALSE FALSE
## duration           1.000000   3.748664660   FALSE FALSE
## campaign           1.669063   0.101971448   FALSE FALSE
## pdays             90.371298   0.065553074   FALSE  TRUE
## previous           7.797194   0.019423133   FALSE FALSE
## poutcome           8.363829   0.007283675   FALSE FALSE
## emp.var.rate       1.767639   0.024278916   FALSE FALSE
## cons.price.idx     1.161257   0.063125182   FALSE FALSE
## cons.conf.idx      1.161257   0.063125182   FALSE FALSE
## euribor3m          1.097589   0.767213752   FALSE FALSE
## nr.employed        1.902273   0.026706808   FALSE FALSE
## y                  7.876724   0.004855783   FALSE FALSE

From the table, it looks like the default and pdays has near zero value observations. But before they can be removed, they should be calculated the proportion

prop.table(table(bank$default == 0))

## 
## FALSE 
##     1

It seems to be the case, the number of near zero value is minor that it is better to ignored

prop.table(table(bank$pdays == 0))

## 
##        FALSE         TRUE 
## 0.9996358163 0.0003641837

Once again, nothing will change.

Checking mutually correlated (irrelevant) variables

For numeric variables, a resemble similarity could lead to excess information for the model. This can be done by checking correlation.

(corrs <-  cor(bank[, num_vars]))

##                          age     duration    campaign       pdays
## age             1.0000000000 -0.000865705  0.00459358 -0.03436895
## duration       -0.0008657050  1.000000000 -0.07169923 -0.04757702
## campaign        0.0045935805 -0.071699226  1.00000000  0.05258357
## pdays          -0.0343689512 -0.047577015  0.05258357  1.00000000
## previous        0.0243647409  0.020640351 -0.07914147 -0.58751386
## emp.var.rate   -0.0003706855 -0.027967884  0.15075381  0.27100417
## cons.price.idx  0.0008567150  0.005312268  0.12783591  0.07888911
## cons.conf.idx   0.1293716142 -0.008172873 -0.01373310 -0.09134235
## euribor3m       0.0107674295 -0.032896656  0.13513251  0.29689911
## nr.employed    -0.0177251319 -0.044703223  0.14409489  0.37260474
##                   previous  emp.var.rate cons.price.idx cons.conf.idx
## age             0.02436474 -0.0003706855    0.000856715   0.129371614
## duration        0.02064035 -0.0279678845    0.005312268  -0.008172873
## campaign       -0.07914147  0.1507538056    0.127835912  -0.013733099
## pdays          -0.58751386  0.2710041743    0.078889109  -0.091342354
## previous        1.00000000 -0.4204891094   -0.203129967  -0.050936351
## emp.var.rate   -0.42048911  1.0000000000    0.775334171   0.196041268
## cons.price.idx -0.20312997  0.7753341708    1.000000000   0.058986182
## cons.conf.idx  -0.05093635  0.1960412681    0.058986182   1.000000000
## euribor3m      -0.45449365  0.9722446712    0.688230107   0.277686220
## nr.employed    -0.50133293  0.9069701013    0.522033977   0.100513432
##                  euribor3m nr.employed
## age             0.01076743 -0.01772513
## duration       -0.03289666 -0.04470322
## campaign        0.13513251  0.14409489
## pdays           0.29689911  0.37260474
## previous       -0.45449365 -0.50133293
## emp.var.rate    0.97224467  0.90697010
## cons.price.idx  0.68823011  0.52203398
## cons.conf.idx   0.27768622  0.10051343
## euribor3m       1.00000000  0.94515443
## nr.employed     0.94515443  1.00000000

corrplot.mixed(corrs,
               upper = "pie", 
               lower = "number")

findCorrelation(corrs,
                cutoff = 0.9,
                names = TRUE)

## [1] "euribor3m"    "emp.var.rate"

Using 90% correlation as metric, emp.var.rate and euribor3m can be filterd out.

Final formula

The dataset has 10 quatitative varaibles and 10 qualitative varaibles. In order to balance the number of quatitative varaibles choosen in the formula, the 8 most important qualitatives will be selected

selected_cate<-cutoff.k(chi2_weights, k = 8)
selected_cate

## [1] "poutcome"    "month"       "job"         "contact"     "education"  
## [6] "marital"     "day_of_week" "housing"

The final formula is written as:

selected_num <- c('age','duration','campaign','pdays','previous',
              'cons.price.idx','cons.conf.idx', 'nr.employed')
groupvars <- paste(paste(selected_cate,collapse = " + "),
                   paste(selected_num, collapse = " + "), sep = ' + ')
model_formula <- as.formula(paste('y',groupvars, sep = ' ~ '))
model_formula

## y ~ poutcome + month + job + contact + education + marital + 
##     day_of_week + housing + age + duration + campaign + pdays + 
##     previous + cons.price.idx + cons.conf.idx + nr.employed

Model building

Split data into train and test set

set.seed(123)

which_train <- createDataPartition(bank$y, p = 0.8, list = FALSE) 

bank_down <- bank[which_train,]
bank_test <- bank[-which_train,]
tabyl(bank_down$y)

##  bank_down$y     n   percent
##           no 29239 0.8873479
##          yes  3712 0.1126521

tabyl(bank_test$y)

##  bank_test$y    n   percent
##           no 7309 0.8873376
##          yes  928 0.1126624

createDataPartition() split into train set and test set equally, especially the distribution of target value

Logistic Regression

The name ‘Regression’ is not suggested that it is a regression model, but rather a classification model by framing output into probability with logistic function. The probability lies between 0 and 1, which can be segerated by user as ‘yes’,‘no’ or ‘true’ or ‘fail’.

set.seed(123) 
ctrl_cv5x3a <- trainControl(method = "repeatedcv", 
                            number = 5, 
                            classProbs = TRUE, 
                            summaryFunction = twoClassSummary, 
                            repeats = 3) 

logit_train <-  
  train(model_formula,  
        data = bank_down,  
        method = "glm", 
        metric = "ROC", 
        family = "binomial", 
        trControl = ctrl_cv5x3a) 

# Summary 
summary(logit_train)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.9225  -0.3073  -0.1936  -0.1374   3.1319  
## 
## Coefficients:
##                                 Estimate  Std. Error z value
## (Intercept)                  90.42120159  4.67683882  19.334
## poutcomenonexistent           0.44680653  0.10486497   4.261
## poutcomesuccess               0.93648913  0.23422194   3.998
## monthaug                      0.09604061  0.11153558   0.861
## monthdec                     -0.09745703  0.21918386  -0.445
## monthjul                      0.23498821  0.10441297   2.251
## monthjun                      0.56180937  0.10035802   5.598
## monthmar                      1.22705220  0.13187509   9.305
## monthmay                     -0.77534066  0.08257615  -9.389
## monthnov                     -0.29955387  0.10667950  -2.808
## monthoct                      0.02920495  0.13708981   0.213
## monthsep                     -0.49268620  0.14710397  -3.349
## `jobblue-collar`             -0.24529425  0.08653543  -2.835
## jobentrepreneur              -0.22684554  0.14007322  -1.619
## jobhousemaid                 -0.00356211  0.16051496  -0.022
## jobmanagement                -0.05197012  0.09489854  -0.548
## jobretired                    0.31049585  0.11771733   2.638
## `jobself-employed`           -0.15042989  0.13203698  -1.139
## jobservices                  -0.10468080  0.09324357  -1.123
## jobstudent                    0.24755407  0.11997074   2.063
## jobtechnician                 0.00284645  0.07806458   0.036
## jobunemployed                -0.00220706  0.14118705  -0.016
## contacttelephone             -0.36551968  0.07779272  -4.699
## educationbasic.6y             0.11866552  0.13338089   0.890
## educationbasic.9y             0.06162253  0.10629834   0.580
## educationhigh.school          0.11918026  0.10182273   1.170
## educationilliterate           1.18742473  0.77076037   1.541
## educationprofessional.course  0.14609463  0.11294514   1.294
## educationuniversity.degree    0.26930033  0.09902771   2.719
## maritalmarried                0.03475556  0.07615878   0.456
## maritalsingle                 0.09857749  0.08690029   1.134
## day_of_weekmon               -0.12767818  0.07283839  -1.753
## day_of_weekthu               -0.00581714  0.07076536  -0.082
## day_of_weektue                0.08302712  0.07235523   1.147
## day_of_weekwed                0.15018716  0.07257820   2.069
## housingyes                   -0.03630511  0.04533513  -0.801
## age                           0.00095468  0.00265923   0.359
## duration                      0.00457544  0.00008158  56.086
## campaign                     -0.03598209  0.01285511  -2.799
## pdays                        -0.00094231  0.00024012  -3.924
## previous                     -0.03209143  0.06709048  -0.478
## cons.price.idx               -0.24964898  0.04820186  -5.179
## cons.conf.idx                 0.01268636  0.00576518   2.201
## nr.employed                  -0.01356947  0.00039554 -34.306
##                                          Pr(>|z|)    
## (Intercept)                  < 0.0000000000000002 ***
## poutcomenonexistent                  0.0000203715 ***
## poutcomesuccess                      0.0000637996 ***
## monthaug                                  0.38920    
## monthdec                                  0.65658    
## monthjul                                  0.02441 *  
## monthjun                             0.0000000217 ***
## monthmar                     < 0.0000000000000002 ***
## monthmay                     < 0.0000000000000002 ***
## monthnov                                  0.00499 ** 
## monthoct                                  0.83130    
## monthsep                                  0.00081 ***
## `jobblue-collar`                          0.00459 ** 
## jobentrepreneur                           0.10534    
## jobhousemaid                              0.98229    
## jobmanagement                             0.58394    
## jobretired                                0.00835 ** 
## `jobself-employed`                        0.25458    
## jobservices                               0.26158    
## jobstudent                                0.03907 *  
## jobtechnician                             0.97091    
## jobunemployed                             0.98753    
## contacttelephone                     0.0000026190 ***
## educationbasic.6y                         0.37364    
## educationbasic.9y                         0.56211    
## educationhigh.school                      0.24181    
## educationilliterate                       0.12342    
## educationprofessional.course              0.19584    
## educationuniversity.degree                0.00654 ** 
## maritalmarried                            0.64813    
## maritalsingle                             0.25664    
## day_of_weekmon                            0.07962 .  
## day_of_weekthu                            0.93449    
## day_of_weektue                            0.25118    
## day_of_weekwed                            0.03852 *  
## housingyes                                0.42324    
## age                                       0.71959    
## duration                     < 0.0000000000000002 ***
## campaign                                  0.00513 ** 
## pdays                                0.0000869844 ***
## previous                                  0.63241    
## cons.price.idx                       0.0000002228 ***
## cons.conf.idx                             0.02777 *  
## nr.employed                  < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32950  degrees of freedom
## Residual deviance: 13916  on 32907  degrees of freedom
## AIC: 14004
## 
## Number of Fisher Scoring iterations: 6

Fitted values

logit_train_fitted <- predict(logit_train, 
                              bank_down, 
                              type = "prob")

Predicted values

logit_train_forecasts <- predict(logit_train, 
                                 bank_test, 
                                 type = "prob")

confusion matrix train set

confusionMatrix(data = as.factor(ifelse(logit_train_fitted["yes"] > 0.5,  
                                        "yes", 
                                        "no")),  
                reference = bank_down$y,  
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  28464  2204
##        yes   775  1508
##                                                
##                Accuracy : 0.9096               
##                  95% CI : (0.9064, 0.9127)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4564               
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.40625              
##             Specificity : 0.97349              
##          Pos Pred Value : 0.66053              
##          Neg Pred Value : 0.92813              
##              Prevalence : 0.11265              
##          Detection Rate : 0.04576              
##    Detection Prevalence : 0.06928              
##       Balanced Accuracy : 0.68987              
##                                                
##        'Positive' Class : yes                  
##

Accuracy : 0.9096 is a good result. However, Balanced Accuracy : 0.68987 is rather disappointed

ROC train set

roc.area(ifelse(bank_down$y == "yes", 1, 0), 
         logit_train_fitted[,"yes"])

## $A
## [1] 0.9323093
## 
## $n.total
## [1] 32951
## 
## $n.events
## [1] 3712
## 
## $n.noevents
## [1] 29239
## 
## $p.value
## [1] 0

confusion matrix bank_test set

cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_train_forecasts["yes"] > 0.5,  
                                                 "yes", 
                                                 "no")),  
                         reference = bank_test$y,  
                         positive = "yes")
cMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7116  520
##        yes  193  408
##                                                
##                Accuracy : 0.9134               
##                  95% CI : (0.9072, 0.9194)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : 0.000000000000004566 
##                                                
##                   Kappa : 0.4884               
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.43966              
##             Specificity : 0.97359              
##          Pos Pred Value : 0.67887              
##          Neg Pred Value : 0.93190              
##              Prevalence : 0.11266              
##          Detection Rate : 0.04953              
##    Detection Prevalence : 0.07296              
##       Balanced Accuracy : 0.70662              
##                                                
##        'Positive' Class : yes                  
##

Accuracy : 0.9134 is good, and Balanced Accuracy : 0.70662 is better than the train set

ROC bank_test set

roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0), 
              logit_train_forecasts[,"yes"]) 
roc

## $A
## [1] 0.9359121
## 
## $n.total
## [1] 8237
## 
## $n.events
## [1] 928
## 
## $n.noevents
## [1] 7309
## 
## $p.value
## [1] 0

ROC bank_test set plot

roc.plot(ifelse(bank_test$y == "yes", 1, 0), 
              logit_train_forecasts[,"yes"])

## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.

Linear Discriminant Analysis

A generalization of Fisher’s linear discriminan to find a linear combination of features by applying Bayes’ theorem for classification. It is more stable than logistic regression.

set.seed(123) 
ctrl_cv5x3a <- trainControl(method = "repeatedcv", 
                            number = 5, 
                            classProbs = TRUE, 
                            summaryFunction = twoClassSummary, 
                            repeats = 3) 
lda_train <-  
  train(model_formula,  
        data = bank_down,  
        method = "lda", 
        metric = "ROC", 
        trControl = ctrl_cv5x3a) 

# Summary 
summary(lda_train)

##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       86     -none-     numeric  
## scaling     43     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         3     -none-     call     
## xNames      43     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Fitted values

lda_train_fitted <- predict(lda_train, 
                              bank_down, 
                              type = "prob")

Predicted values

lda_train_forecasts <- predict(lda_train, 
                                 bank_test, 
                                 type = "prob")

confusion matrix train set

confusionMatrix(data = as.factor(ifelse(lda_train_fitted["yes"] > 0.5,  
                                        "yes", 
                                        "no")),  
                reference = bank_down$y,  
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  28083  1848
##        yes  1156  1864
##                                                
##                Accuracy : 0.9088               
##                  95% CI : (0.9057, 0.9119)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5036               
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.50216              
##             Specificity : 0.96046              
##          Pos Pred Value : 0.61722              
##          Neg Pred Value : 0.93826              
##              Prevalence : 0.11265              
##          Detection Rate : 0.05657              
##    Detection Prevalence : 0.09165              
##       Balanced Accuracy : 0.73131              
##                                                
##        'Positive' Class : yes                  
##

Accuracy : 0.9088 and Balanced Accuracy : 0.73131 indicate an good result

ROC train set

roc.area(ifelse(bank_down$y == "yes", 1, 0), 
         lda_train_fitted[,"yes"])

## $A
## [1] 0.9315323
## 
## $n.total
## [1] 32951
## 
## $n.events
## [1] 3712
## 
## $n.noevents
## [1] 29239
## 
## $p.value
## [1] 0

confusion matrix bank_test set

cMatrix<-confusionMatrix(data = as.factor(ifelse(lda_train_forecasts["yes"] > 0.5,  
                                                 "yes", 
                                                 "no")),  
                         reference = bank_test$y,  
                         positive = "yes")
cMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7008  441
##        yes  301  487
##                                           
##                Accuracy : 0.9099          
##                  95% CI : (0.9035, 0.916) 
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.00000000001313
##                                           
##                   Kappa : 0.5177          
##  Mcnemar's Test P-Value : 0.00000033457525
##                                           
##             Sensitivity : 0.52478         
##             Specificity : 0.95882         
##          Pos Pred Value : 0.61802         
##          Neg Pred Value : 0.94080         
##              Prevalence : 0.11266         
##          Detection Rate : 0.05912         
##    Detection Prevalence : 0.09567         
##       Balanced Accuracy : 0.74180         
##                                           
##        'Positive' Class : yes             
##

Accuracy : 0.9099 while Balanced Accuracy : 0.74180

ROC bank_test set

roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0), 
              lda_train_forecasts[,"yes"]) 
roc

## $A
## [1] 0.9346318
## 
## $n.total
## [1] 8237
## 
## $n.events
## [1] 928
## 
## $n.noevents
## [1] 7309
## 
## $p.value
## [1] 0

ROC bank_test set plot

roc.plot(ifelse(bank_test$y == "yes", 1, 0), 
              lda_train_forecasts[,"yes"])

## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## lda_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.

Net elastic

When the data have a significant number of variables, it might be useful to find a reduced set of variables so that the performance is optimal.

In such case, imposing a penalty to the logistic regression may affect the variable( which contribute less in the model)’s coefficient move toward zero. This method is widely called regularization.

There are 3 well-known regularizations:

Lasso Regression penalizes the sum of absolute values (L1 penalty)
Ridge Regression: penalizes sum of squared coefficients (L2 penalty)
Net elastic: the combination of both Lasso and Ridge

In this report, the net elastic was used depsite the fact that under some circumstances, specific penalty may perfom better.

Unlike other model, regularization methods only accept, matrix as input instead of raw data

# Dumy code categorical predictor variables
x <- model.matrix(y~., bank_down)[,-1]
# Convert the outcome (class) to a numerical variable
y <- ifelse(bank_down$y == "yes", 1, 0)

Find the best lambda using cross-validation

cv.net <- cv.glmnet(x, y, alpha = 0.5, family = "binomial")

Display regression coefficients

coef(cv.net)

## 48 x 1 sparse Matrix of class "dgCMatrix"
##                                          1
## (Intercept)                  41.3011799632
## age                           .           
## jobblue-collar               -0.1556873448
## jobentrepreneur               .           
## jobhousemaid                  .           
## jobmanagement                 .           
## jobretired                    0.2136182389
## jobself-employed              .           
## jobservices                   .           
## jobstudent                    0.2027000279
## jobtechnician                 .           
## jobunemployed                 .           
## maritalmarried                .           
## maritalsingle                 0.0239695860
## educationbasic.6y             .           
## educationbasic.9y             .           
## educationhigh.school          .           
## educationilliterate           .           
## educationprofessional.course  .           
## educationuniversity.degree    0.0962649229
## defaultyes                    .           
## housingyes                    .           
## loanyes                       .           
## contacttelephone             -0.1950296434
## monthaug                      .           
## monthdec                      .           
## monthjul                      .           
## monthjun                      0.1440108907
## monthmar                      1.1151635008
## monthmay                     -0.7413631225
## monthnov                     -0.1916883269
## monthoct                      0.0776097416
## monthsep                     -0.0592038072
## day_of_weekmon               -0.0531327440
## day_of_weekthu                .           
## day_of_weektue                .           
## day_of_weekwed                0.0169268253
## duration                      0.0040981504
## campaign                     -0.0002195744
## pdays                        -0.0007958577
## previous                      .           
## poutcomenonexistent           0.1793600384
## poutcomesuccess               0.8433424805
## emp.var.rate                 -0.1846557153
## cons.price.idx                .           
## cons.conf.idx                 0.0115821734
## euribor3m                    -0.0247689761
## nr.employed                  -0.0084785511

Shrinking coeffienct does not mean it will reduce the number of variable totally.

Fit the final model on the training data

net_train <- glmnet(x, y, alpha = 0.5, family = "binomial",
                lambda = cv.net$lambda.min)

Make predictions on the test data

x_test <- model.matrix(y ~., bank_test)[,-1]
net_forecasts <- cv.net %>% predict(newx = x_test)

Confusion matrix bank_test set

cMatrix<-confusionMatrix(data = as.factor(ifelse(net_forecasts > 0.5,"yes", "no")),  
                         reference = bank_test$y,  
                         positive = "yes")  
cMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7201  656
##        yes  108  272
##                                                
##                Accuracy : 0.9072               
##                  95% CI : (0.9008, 0.9134)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : 0.000000002435       
##                                                
##                   Kappa : 0.375                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.29310              
##             Specificity : 0.98522              
##          Pos Pred Value : 0.71579              
##          Neg Pred Value : 0.91651              
##              Prevalence : 0.11266              
##          Detection Rate : 0.03302              
##    Detection Prevalence : 0.04613              
##       Balanced Accuracy : 0.63916              
##                                                
##        'Positive' Class : yes                  
##

ROC bank_test set

roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0), 
              ifelse(net_forecasts > 0.5, 1, 0)) 
roc

## $A
## [1] 0.6391636
## 
## $n.total
## [1] 8237
## 
## $n.events
## [1] 928
## 
## $n.noevents
## [1] 7309
## 
## $p.value
## [1] 0

ROC plot bank_test set

roc.plot(ifelse(bank_test$y == "yes", 1, 0), 
                    net_forecasts)

Stepwise methods.

Instead of using custom formula, there is a method to let add/remove features automatically by the model until the model is best fit.

There are 3 schools of this strategy:

Forward selection, the model starts with formula with no predictor then add one by one
Backward elimination, inverse of forward selection, starting with all predictors then remove one by one
Stepwise selection: a combination of forward selection and backward elimination.

The downside of this method is take long time to train. Therefore, to reduce the long calulation, only 10% of train data will be used and backward stepwise is used as an example of the three

set.seed(123) 
which_small <- createDataPartition(bank_down$y,
                                   p = 0.1,
                                   list = FALSE)

bank_down_small <- bank_down[which_small,]

ctrl_nocv <- trainControl(method = "none")

logit_backward_train  <- 
  train(y~.,
        data = bank_down_small, 
        # stepwise method
        method = "glmStepAIC",
        # additional argument
        direction = "backward", 
        trControl = ctrl_nocv)

# Summary 
summary(logit_backward_train)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7419  -0.2951  -0.1807  -0.1264   2.9967  
## 
## Coefficients:
##                      Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)      -175.8538795   20.4705367  -8.591 < 0.0000000000000002
## jobentrepreneur    -2.2981321    0.7806478  -2.944             0.003241
## jobunemployed       0.7386721    0.4255651   1.736             0.082609
## maritalmarried      0.6584864    0.2887554   2.280             0.022582
## maritalsingle       0.7550437    0.2977289   2.536             0.011212
## contacttelephone   -0.8756656    0.2394056  -3.658             0.000255
## monthaug            0.5696218    0.2582666   2.206             0.027415
## monthmar            2.2894933    0.3802869   6.020        0.00000000174
## monthmay           -0.4862553    0.1948151  -2.496             0.012561
## day_of_weekmon     -0.4459487    0.1895273  -2.353             0.018625
## duration            0.0047914    0.0002626  18.248 < 0.0000000000000002
## previous           -0.1834359    0.1221841  -1.501             0.133276
## poutcomesuccess     1.7540073    0.3282482   5.344        0.00000009115
## emp.var.rate       -1.0955977    0.0825343 -13.274 < 0.0000000000000002
## cons.price.idx      1.8558290    0.2197159   8.446 < 0.0000000000000002
## cons.conf.idx       0.0558012    0.0157176   3.550             0.000385
##                     
## (Intercept)      ***
## jobentrepreneur  ** 
## jobunemployed    .  
## maritalmarried   *  
## maritalsingle    *  
## contacttelephone ***
## monthaug         *  
## monthmar         ***
## monthmay         *  
## day_of_weekmon   *  
## duration         ***
## previous            
## poutcomesuccess  ***
## emp.var.rate     ***
## cons.price.idx   ***
## cons.conf.idx    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2323.4  on 3295  degrees of freedom
## Residual deviance: 1316.9  on 3280  degrees of freedom
## AIC: 1348.9
## 
## Number of Fisher Scoring iterations: 7

Fitted values

logit_backward_train_fitted <- predict(logit_backward_train, 
                              bank_down, 
                              type = "prob")

Predicted values

logit_backward_train_forecasts <- predict(logit_backward_train, 
                                 bank_test, 
                                 type = "prob")

confusion matrix train set

confusionMatrix(data = as.factor(ifelse(logit_backward_train_fitted["yes"] > 0.5,  
                                        "yes", 
                                        "no")),  
                reference = bank_down$y,  
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  28330  2075
##        yes   909  1637
##                                                
##                Accuracy : 0.9094               
##                  95% CI : (0.9063, 0.9125)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4751               
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.44100              
##             Specificity : 0.96891              
##          Pos Pred Value : 0.64297              
##          Neg Pred Value : 0.93175              
##              Prevalence : 0.11265              
##          Detection Rate : 0.04968              
##    Detection Prevalence : 0.07727              
##       Balanced Accuracy : 0.70496              
##                                                
##        'Positive' Class : yes                  
##

ROC train set

roc.area(ifelse(bank_down$y == "yes", 1, 0), 
         logit_backward_train_fitted[,"yes"])

## $A
## [1] 0.9243167
## 
## $n.total
## [1] 32951
## 
## $n.events
## [1] 3712
## 
## $n.noevents
## [1] 29239
## 
## $p.value
## [1] 0

confusion matrix bank_test set

cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_backward_train_forecasts["yes"] > 0.5,  
                                                 "yes", 
                                                 "no")),  
                         reference = bank_test$y,  
                         positive = "yes")
cMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7099  501
##        yes  210  427
##                                                
##                Accuracy : 0.9137               
##                  95% CI : (0.9074, 0.9197)     
##     No Information Rate : 0.8873               
##     P-Value [Acc > NIR] : 0.000000000000002517 
##                                                
##                   Kappa : 0.4998               
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.46013              
##             Specificity : 0.97127              
##          Pos Pred Value : 0.67033              
##          Neg Pred Value : 0.93408              
##              Prevalence : 0.11266              
##          Detection Rate : 0.05184              
##    Detection Prevalence : 0.07733              
##       Balanced Accuracy : 0.71570              
##                                                
##        'Positive' Class : yes                  
##

ROC bank_test set

roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0), 
              logit_backward_train_forecasts[,"yes"]) 
roc

## $A
## [1] 0.9249201
## 
## $n.total
## [1] 8237
## 
## $n.events
## [1] 928
## 
## $n.noevents
## [1] 7309
## 
## $p.value
## [1] 0

ROC bank_test set plot

roc.plot(ifelse(bank_test$y == "yes", 1, 0), 
              logit_backward_train_forecasts[,"yes"])

## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_backward_train_forecasts[, : Large amount of unique predictions used
## as thresholds. Consider specifying thresholds.

Down sampling and logistic regression

When investigated, the target variables rather have imbalance responses

plot(bank$y)

Downsampling data

Resampling the data could be resovle this problem, this report employs simple downsampling since the data is huge and other method require complex modification. After that, logistic regression is used once again to give a clear view between with and without resampling.

bank_down <- downSample(# a matrix or data frame of predictor variables
  x = bank, 
  # a factor variable with the class memberships
  y = bank$y,
  yname = 'y') 
  
plot(bank_down$y)

Logistic Regression with downsampling data

set.seed(123) 
ctrl_cv5x3a <- trainControl(method = "repeatedcv", 
                            number = 5, 
                            classProbs = TRUE, 
                            summaryFunction = twoClassSummary, 
                            repeats = 3) 

logit_train_down <-  
  train(model_formula,  
        data = bank_down,  
        method = "glm", 
        metric = "ROC", 
        family = "binomial", 
        trControl = ctrl_cv5x3a)

# Summary 
summary(logit_train_down)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7113  -0.3834  -0.0299   0.4696   2.8042  
## 
## Coefficients:
##                                 Estimate  Std. Error z value
## (Intercept)                  134.7905764   7.2883351  18.494
## poutcomenonexistent            0.5878466   0.1788857   3.286
## poutcomesuccess                1.1744008   0.4449877   2.639
## monthaug                       0.0560733   0.1783100   0.314
## monthdec                       0.6623160   0.5307674   1.248
## monthjul                       0.1030927   0.1557615   0.662
## monthjun                       0.2412178   0.1462258   1.650
## monthmar                       1.0741070   0.2035046   5.278
## monthmay                      -1.2838507   0.1219229 -10.530
## monthnov                      -0.4482753   0.1588433  -2.822
## monthoct                       0.8471415   0.2526704   3.353
## monthsep                      -0.6356747   0.2401612  -2.647
## `jobblue-collar`              -0.2672402   0.1247763  -2.142
## jobentrepreneur               -0.1778486   0.1905039  -0.934
## jobhousemaid                  -0.0725454   0.2398715  -0.302
## jobmanagement                 -0.3081139   0.1344372  -2.292
## jobretired                     0.5974513   0.1807795   3.305
## `jobself-employed`             0.0071891   0.1924446   0.037
## jobservices                   -0.2369093   0.1364427  -1.736
## jobstudent                     0.3540091   0.1924422   1.840
## jobtechnician                 -0.0298042   0.1159295  -0.257
## jobunemployed                  0.1636437   0.2062688   0.793
## contacttelephone               0.1103048   0.1177006   0.937
## educationbasic.6y             -0.2131415   0.1958554  -1.088
## educationbasic.9y             -0.0887367   0.1524787  -0.582
## educationhigh.school           0.0675848   0.1502255   0.450
## educationilliterate           12.4887303 231.4142277   0.054
## educationprofessional.course   0.2302059   0.1644193   1.400
## educationuniversity.degree     0.3136710   0.1455642   2.155
## maritalmarried                 0.0572722   0.1093055   0.524
## maritalsingle                  0.1960455   0.1255927   1.561
## day_of_weekmon                -0.2408250   0.1059709  -2.273
## day_of_weekthu                -0.1380881   0.1072717  -1.287
## day_of_weektue                -0.0662619   0.1080861  -0.613
## day_of_weekwed                -0.0223141   0.1072929  -0.208
## housingyes                     0.1116696   0.0666635   1.675
## age                           -0.0012453   0.0038664  -0.322
## duration                       0.0070967   0.0001702  41.689
## campaign                      -0.0449017   0.0174160  -2.578
## pdays                         -0.0010449   0.0004482  -2.332
## previous                       0.0282864   0.1302363   0.217
## cons.price.idx                -0.5825783   0.0768454  -7.581
## cons.conf.idx                 -0.0073905   0.0094266  -0.784
## nr.employed                   -0.0160745   0.0006359 -25.277
##                                          Pr(>|z|)    
## (Intercept)                  < 0.0000000000000002 ***
## poutcomenonexistent                       0.00102 ** 
## poutcomesuccess                           0.00831 ** 
## monthaug                                  0.75316    
## monthdec                                  0.21209    
## monthjul                                  0.50806    
## monthjun                                  0.09902 .  
## monthmar                       0.0000001305677283 ***
## monthmay                     < 0.0000000000000002 ***
## monthnov                                  0.00477 ** 
## monthoct                                  0.00080 ***
## monthsep                                  0.00812 ** 
## `jobblue-collar`                          0.03221 *  
## jobentrepreneur                           0.35053    
## jobhousemaid                              0.76232    
## jobmanagement                             0.02191 *  
## jobretired                                0.00095 ***
## `jobself-employed`                        0.97020    
## jobservices                               0.08251 .  
## jobstudent                                0.06583 .  
## jobtechnician                             0.79711    
## jobunemployed                             0.42757    
## contacttelephone                          0.34867    
## educationbasic.6y                         0.27648    
## educationbasic.9y                         0.56059    
## educationhigh.school                      0.65279    
## educationilliterate                       0.95696    
## educationprofessional.course              0.16148    
## educationuniversity.degree                0.03117 *  
## maritalmarried                            0.60030    
## maritalsingle                             0.11853    
## day_of_weekmon                            0.02305 *  
## day_of_weekthu                            0.19800    
## day_of_weektue                            0.53984    
## day_of_weekwed                            0.83525    
## housingyes                                0.09391 .  
## age                                       0.74739    
## duration                     < 0.0000000000000002 ***
## campaign                                  0.00993 ** 
## pdays                                     0.01972 *  
## previous                                  0.82806    
## cons.price.idx                 0.0000000000000342 ***
## cons.conf.idx                             0.43304    
## nr.employed                  < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12864.8  on 9279  degrees of freedom
## Residual deviance:  6022.5  on 9236  degrees of freedom
## AIC: 6110.5
## 
## Number of Fisher Scoring iterations: 12

Fitted values

logit_train_down_fitted <- predict(logit_train_down, 
                              bank_down, 
                              type = "prob")

Predicted values

logit_train_down_forecasts <- predict(logit_train_down, 
                                 bank_test, 
                                 type = "prob")

confusion matrix train set

confusionMatrix(data = as.factor(ifelse(logit_train_down_fitted["yes"] > 0.5,  
                                        "yes", 
                                        "no")),  
                reference = bank_down$y,  
                positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3980  553
##        yes  660 4087
##                                                
##                Accuracy : 0.8693               
##                  95% CI : (0.8623, 0.8761)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7386               
##  Mcnemar's Test P-Value : 0.002338             
##                                                
##             Sensitivity : 0.8808               
##             Specificity : 0.8578               
##          Pos Pred Value : 0.8610               
##          Neg Pred Value : 0.8780               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4404               
##    Detection Prevalence : 0.5115               
##       Balanced Accuracy : 0.8693               
##                                                
##        'Positive' Class : yes                  
##

Accuracy : 0.9096 is a good result. However, Balanced Accuracy : 0.68987 is rather disappointed

ROC train set

roc.area(ifelse(bank_down$y == "yes", 1, 0), 
         logit_train_down_fitted[,"yes"])

## $A
## [1] 0.9385968
## 
## $n.total
## [1] 9280
## 
## $n.events
## [1] 4640
## 
## $n.noevents
## [1] 4640
## 
## $p.value
## [1] 0

confusion matrix bank_test set

cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_train_down_forecasts["yes"] > 0.5,  
                                                 "yes", 
                                                 "no")),  
                         reference = bank_test$y,  
                         positive = "yes")
cMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6256  104
##        yes 1053  824
##                                              
##                Accuracy : 0.8595             
##                  95% CI : (0.8518, 0.867)    
##     No Information Rate : 0.8873             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.5143             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.8879             
##             Specificity : 0.8559             
##          Pos Pred Value : 0.4390             
##          Neg Pred Value : 0.9836             
##              Prevalence : 0.1127             
##          Detection Rate : 0.1000             
##    Detection Prevalence : 0.2279             
##       Balanced Accuracy : 0.8719             
##                                              
##        'Positive' Class : yes                
##

Accuracy : 0.9134 is good, and Balanced Accuracy : 0.70662 is better than the train set

ROC bank_test set

roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0), 
              logit_train_down_forecasts[,"yes"]) 
roc

## $A
## [1] 0.9391079
## 
## $n.total
## [1] 8237
## 
## $n.events
## [1] 928
## 
## $n.noevents
## [1] 7309
## 
## $p.value
## [1] 0

ROC bank_test set plot

roc.plot(ifelse(bank_test$y == "yes", 1, 0), 
              logit_train_forecasts[,"yes"])

## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.

Final result and conclusion

# Comparison test result ---- 
for (i in seq(1,length(result),1)){ 
  result_df[i,1] <- result[[i]][["overall"]][["Accuracy"]] 
  result_df[i,2] <- result[[i]][["byClass"]][["Balanced Accuracy"]] 
  result_df[i,3] <- result[[i]][[7]][["A"]] 
  result_df[i,4] <- result[[i]][["byClass"]][["Sensitivity"]] 
  result_df[i,5] <- result[[i]][["byClass"]][["Specificity"]] 
} 

row.names(result_df)<- c('Logistic Regression','Linear Discriminant', 
                         'Net Elastic', 'Logit Backward', 'Logistic Regression (downsampling)') 
result_df

##                                     Accuracy Balance_acc       Roc
## Logistic Regression                0.9134394   0.7066247 0.9359121
## Linear Discriminant                0.9099187   0.7418012 0.9346318
## Net Elastic                        0.9072478   0.6391636 0.6391636
## Logit Backward                     0.9136822   0.7156988 0.9249201
## Logistic Regression (downsampling) 0.8595362   0.8719310 0.9391079
##                                    Sensitivity Specificity
## Logistic Regression                  0.4396552   0.9735942
## Linear Discriminant                  0.5247845   0.9588179
## Net Elastic                          0.2931034   0.9852237
## Logit Backward                       0.4601293   0.9712683
## Logistic Regression (downsampling)   0.8879310   0.8559310

5 models have been deployed to predict whether if the client will submit to deposit. Each of them have there strength and weakness, which is hard to say which one is the best, but the best is depended on the situation.

The Logistic Regression does the job perfectly, having acceptable Accuracy. Linear Discriminant, however, does not outshine any other model in any criteria. Meanwhile, Net Elastic has the best Sepcificity and the worst Sensitivity, so it have more chance to predict correct potential customer but more chance to fail to idential non-potential ones, if losing customer is costly, Net Elastic should be avoid to be used. Even though Logit Backward runs slow, it prove to be the one with highest Accuracy, still only higher than Logistic Regression does not justify its cost. When using Logistic Regression with downsamping data, the Sensitivity shoots up at the cost of lower other critera, except Balance Accurarcy, which scores the top, therefore, depend on the business context, it should be considered.

There are some models have been built around such as kNN and SVM, however, when dealing with large dataset and low computing power, they requires a significant amount of time but the result is not improved much. Nevertheless, based on good result of simple models, with more modifications, whether in the data or in the models, the result can be improved further.

Deal on phone. Finding telemarketing success by machine learning

Duy Tuan Doan

May 2019

Executive summary

Data

Visualization

For quantitative variables

For quanlitative variables

Unknown value as NA

NA processing

Features selection

For qualitative variables

For quantitative variable

Final formula

Model building

Split data into train and test set

Logistic Regression

Linear Discriminant Analysis

Net elastic

Stepwise methods.

Down sampling and logistic regression

Downsampling data

Logistic Regression with downsampling data

Final result and conclusion