Data and problem description

Targeting the right customers and finding those that are most likely to buy the product that a company offers, is one of the most important tasks of businesses that already gathered some broad knowledge about their clients. In this paper we will try to explore the data gathered by portugese researchers publicly available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. This dataset contains information about over 41 000 observations which include variables about client of a bank, data related with the previous and current campaings held by the bank and social and economic context attributes present at a particular time. Main goal is to build a model which will predict the outcome of the campaign held by the bank, namely whether the client has subscribed a term deposit.

Exploratory data analysis, data transformations and variable selection

Initially, let’s summarize the data with the use of in-built r functions.

bank <- read.csv2("bank-additional-full.csv")
summary(bank)
##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing           loan      
##  university.degree  :12168   no     :32588   no     :18622   no     :33950  
##  high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
##  professional.course: 5243                                                  
##  basic.4y           : 4176                                                  
##  basic.6y           : 2292                                                  
##  (Other)            : 1749                                                  
##       contact          month       day_of_week    duration     
##  cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
##  telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
##                    aug    : 6178   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8090    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2632               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##   emp.var.rate   cons.price.idx cons.conf.idx    euribor3m      nr.employed   
##  01.kwi :16234   93.994 :7763   -36.4  :7763   4.857  : 2868   5228.1 :16234  
##  -1.8   : 9184   93.918 :6685   -42.7  :6685   4.962  : 2613   5099.1 : 8534  
##  1.1    : 7763   92.893 :5794   -46.2  :5794   4.963  : 2487   5191   : 7763  
##  -0.1   : 3683   93.444 :5175   -36.1  :5175   4.961  : 1902   5195.8 : 3683  
##  -2.9   : 1663   94.465 :4374   -41.8  :4374   4.856  : 1210   5076.2 : 1663  
##  -3.4   : 1071   93.2   :3616   -42    :3616   4.964  : 1175   5017.5 : 1071  
##  (Other): 1590   (Other):7781   (Other):7781   (Other):28933   (Other): 2240  
##    y        
##  no :36548  
##  yes: 4640  
##             
##             
##             
##             
## 
str(bank)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : Factor w/ 10 levels "-0.1","-0.2",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ cons.price.idx: Factor w/ 26 levels "92.201","92.379",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ cons.conf.idx : Factor w/ 26 levels "-26.9","-29.8",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ euribor3m     : Factor w/ 315 levels "0.634","0.635",..: 277 277 277 277 277 277 277 277 277 277 ...
##  $ nr.employed   : Factor w/ 11 levels "4963.6","4991.6",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Data about customers

bank$y <- as.factor(ifelse(bank$y == 'yes', 1, 0))
prop.table(table(bank$y))
## 
##         0         1 
## 0.8873458 0.1126542

Age

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gmodels)

ggplot(bank, aes(x = age)) +
  geom_bar() +
  facet_grid(y ~ ., scales = 'free_y') +
  geom_vline(xintercept = c(30, 60), col = "blue",linetype = "dashed") +
  scale_x_continuous(breaks = seq(0, 100, 5))

bank = bank %>%
  mutate(age = ifelse(age > 60, 'old', ifelse(age > 30, 'middle', 'young')))

bank$age <- as.factor(bank$age)

We can distinguish 3 main patterns of how age influences the outcome variable. Young people (below 30 years old) and reaching or being in their retirement (over 60 years old) are more willing to take deposits than middle aged people. Therefore we divide the continous age variable into 3 categories.

Job

CrossTable(bank$job, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##               | bank$y 
##      bank$job |         0 |         1 | Row Total | 
## --------------|-----------|-----------|-----------|
##        admin. |      9070 |      1352 |     10422 | 
##               |     0.870 |     0.130 |     0.253 | 
## --------------|-----------|-----------|-----------|
##   blue-collar |      8616 |       638 |      9254 | 
##               |     0.931 |     0.069 |     0.225 | 
## --------------|-----------|-----------|-----------|
##  entrepreneur |      1332 |       124 |      1456 | 
##               |     0.915 |     0.085 |     0.035 | 
## --------------|-----------|-----------|-----------|
##     housemaid |       954 |       106 |      1060 | 
##               |     0.900 |     0.100 |     0.026 | 
## --------------|-----------|-----------|-----------|
##    management |      2596 |       328 |      2924 | 
##               |     0.888 |     0.112 |     0.071 | 
## --------------|-----------|-----------|-----------|
##       retired |      1286 |       434 |      1720 | 
##               |     0.748 |     0.252 |     0.042 | 
## --------------|-----------|-----------|-----------|
## self-employed |      1272 |       149 |      1421 | 
##               |     0.895 |     0.105 |     0.035 | 
## --------------|-----------|-----------|-----------|
##      services |      3646 |       323 |      3969 | 
##               |     0.919 |     0.081 |     0.096 | 
## --------------|-----------|-----------|-----------|
##       student |       600 |       275 |       875 | 
##               |     0.686 |     0.314 |     0.021 | 
## --------------|-----------|-----------|-----------|
##    technician |      6013 |       730 |      6743 | 
##               |     0.892 |     0.108 |     0.164 | 
## --------------|-----------|-----------|-----------|
##    unemployed |       870 |       144 |      1014 | 
##               |     0.858 |     0.142 |     0.025 | 
## --------------|-----------|-----------|-----------|
##       unknown |       293 |        37 |       330 | 
##               |     0.888 |     0.112 |     0.008 | 
## --------------|-----------|-----------|-----------|
##  Column Total |     36548 |      4640 |     41188 | 
## --------------|-----------|-----------|-----------|
## 
## 
bank = bank %>% 
  filter(job != "unknown")

Certain occupations have visibly greater proportions of positive class than other. We remove the observations with unknown values.

Marital

CrossTable(bank$marital, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40858 
## 
##  
##              | bank$y 
## bank$marital |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     divorced |      4126 |       473 |      4599 | 
##              |     0.897 |     0.103 |     0.113 | 
## -------------|-----------|-----------|-----------|
##      married |     22178 |      2516 |     24694 | 
##              |     0.898 |     0.102 |     0.604 | 
## -------------|-----------|-----------|-----------|
##       single |      9889 |      1605 |     11494 | 
##              |     0.860 |     0.140 |     0.281 | 
## -------------|-----------|-----------|-----------|
##      unknown |        62 |         9 |        71 | 
##              |     0.873 |     0.127 |     0.002 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36255 |      4603 |     40858 | 
## -------------|-----------|-----------|-----------|
## 
## 
bank = bank %>% 
  filter(marital != "unknown")

Single people decide to subscribe to a term deposit more often than married or divorced.

Education

table_educ <- table(bank$education, bank$y)
mosaicplot(table_educ, color = T)

nrow(bank[bank$education == 'illiterate',])
## [1] 18
bank = bank %>%
  filter(education != 'illiterate')

Since there are only 18 observations in ‘illiterate’ category I will remove those variables that has such level of education.

Default

summary(bank$default)
##      no unknown     yes 
##   32337    8429       3
table(bank$y, bank$default)
##    
##        no unknown   yes
##   0 28182    7994     3
##   1  4155     435     0
bank <- select(bank, -c('default'))

As there is no differentiation in this variable (all observations except 3 belong to groups “no” and “unknown”) we remove this variable as it doesn’t provide us with any vital information

Loan

CrossTable(bank$loan, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | bank$y 
##    bank$loan |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     29799 |      3806 |     33605 | 
##              |     0.887 |     0.113 |     0.824 | 
## -------------|-----------|-----------|-----------|
##      unknown |       877 |       107 |       984 | 
##              |     0.891 |     0.109 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |      5503 |       677 |      6180 | 
##              |     0.890 |     0.110 |     0.152 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 
chisq.test(bank$loan, bank$y)
## 
##  Pearson's Chi-squared test
## 
## data:  bank$loan and bank$y
## X-squared = 0.86841, df = 2, p-value = 0.6478
bank <- select(bank, -c('loan'))

The differences in groups seem to not be statistically significant. We remove this variable.

Housing

CrossTable(bank$housing, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | bank$y 
## bank$housing |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     16416 |      2003 |     18419 | 
##              |     0.891 |     0.109 |     0.452 | 
## -------------|-----------|-----------|-----------|
##      unknown |       877 |       107 |       984 | 
##              |     0.891 |     0.109 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |     18886 |      2480 |     21366 | 
##              |     0.884 |     0.116 |     0.524 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 
chisq.test(bank$housing, bank$y)
## 
##  Pearson's Chi-squared test
## 
## data:  bank$housing and bank$y
## X-squared = 5.4627, df = 2, p-value = 0.06513
bank <- select(bank, -c('housing'))

In here the differences among groups are also not statistically significant. We remove ‘housing’.

Social and economic context attributes

unique(bank$emp.var.rate)
##  [1] 1.1    01.kwi -0.1   -0.2   -1.8   -2.9   -3.4   -3     -1.7   -1.1  
## Levels: -0.1 -0.2 -1.1 -1.7 -1.8 -2.9 -3 -3.4 01.kwi 1.1
unique(bank$euribor3m)
##   [1] 4.857  4.856  4.855  4.859  kwi.86 4.858  4.864  4.865  4.866  4.967 
##  [11] 4.961  4.959  4.958  kwi.96 4.962  4.955  4.947  4.956  4.966  4.963 
##  [21] 4.957  4.968  kwi.97 4.965  4.964  5.045  5      4.936  4.921  4.918 
##  [31] 4.912  4.827  4.794  kwi.76 4.733  04.lip 4.663  4.592  4.474  4.406 
##  [41] 4.343  4.286  4.245  4.223  4.191  4.153  04.gru 4.076  4.021  3.901 
##  [51] 3.879  3.853  3.816  3.743  3.669  3.563  3.488  3.428  3.329  3.282 
##  [61] 3.053  1.811  1.799  1.778  1.757  1.726  1.703  1.687  1.663  sty.65
##  [71] sty.64 1.629  1.614  1.602  1.584  1.574  sty.56 1.556  1.548  1.538 
##  [81] 1.531  sty.52 sty.51 1.498  1.483  1.479  1.466  1.453  1.445  1.435 
##  [91] 1.423  1.415  sty.41 1.405  1.406  01.kwi 1.392  1.384  1.372  1.365 
## [101] 1.354  1.344  1.334  1.327  1.313  1.299  1.291  1.281  1.266  sty.25
## [111] 1.244  1.259  1.264  sty.27 1.262  sty.26 1.268  1.286  1.252  1.235 
## [121] 1.224  1.215  1.206  1.099  1.085  1.072  1.059  1.048  1.044  1.029 
## [131] 1.018  1.007  0.996  0.979  0.969  0.944  0.937  0.933  0.927  0.921 
## [141] 0.914  0.908  0.903  0.899  0.884  0.883  0.881  0.879  0.873  0.869 
## [151] 0.861  0.859  0.854  0.851  0.849  0.843  0.838  0.834  0.829  0.825 
## [161] 0.821  0.819  0.813  0.809  0.803  0.797  0.788  0.781  0.778  0.773 
## [171] 0.771  0.77   0.768  0.766  0.762  0.755  0.749  0.743  0.741  0.739 
## [181] 0.75   0.753  0.754  0.752  0.744  0.74   0.742  0.737  0.735  0.733 
## [191] 0.73   0.731  0.728  0.724  0.722  0.72   0.719  0.716  0.715  0.714 
## [201] 0.718  0.721  0.717  0.712  0.71   0.709  0.708  0.706  0.707  0.7   
## [211] 0.655  0.654  0.653  0.652  0.651  0.65   0.649  0.646  0.644  0.643 
## [221] 0.639  0.637  0.635  0.636  0.634  0.638  0.64   0.642  0.645  0.659 
## [231] 0.663  0.668  0.672  0.677  0.682  0.683  0.684  0.685  0.688  0.69  
## [241] 0.692  0.695  0.697  0.699  0.701  0.702  0.704  0.711  0.713  0.723 
## [251] 0.727  0.729  0.732  0.748  0.761  0.767  0.782  0.79   0.793  0.802 
## [261] 0.81   0.822  0.827  0.835  0.84   0.846  0.87   0.876  0.885  0.889 
## [271] 0.893  0.896  0.898  0.9    0.904  0.905  0.895  0.894  0.891  0.89  
## [281] 0.888  0.886  0.882  0.88   0.878  0.877  0.942  0.953  0.956  0.959 
## [291] 0.965  0.972  0.977  0.982  0.985  0.987  0.993  1      1.008  1.016 
## [301] 1.025  1.032  1.037  1.043  1.045  1.047  01.maj 1.049  1.046  1.041 
## [311] 1.039  1.035  01.mar 1.031  1.028 
## 315 Levels: 0.634 0.635 0.636 0.637 0.638 0.639 0.64 0.642 0.643 0.644 ... sty.65
bank$emp.var.rate <- gsub("sty", '1', bank$emp.var.rate)
bank$emp.var.rate <- gsub("kwi", '4', bank$emp.var.rate)
bank$euribor3m <- gsub("sty", '1', bank$euribor3m)
bank$euribor3m <- gsub("mar", '3', bank$euribor3m)
bank$euribor3m <- gsub("kwi", '3', bank$euribor3m)
bank$euribor3m <- gsub("maj", '5', bank$euribor3m)
bank$euribor3m <- gsub("lip", '7', bank$euribor3m)
bank$euribor3m <- gsub("gru", '12', bank$euribor3m)

Since some of the numbers where interpreted as months in Excel, we have to decode them into numbers. Next we convert all the variables from this group into numeric. Than we check the correlation among them.

bank$emp.var.rate <- as.numeric(bank$emp.var.rate) 
bank$euribor3m <- as.numeric(bank$euribor3m) 
bank$cons.price.idx <- as.numeric(bank$cons.price.idx) 
bank$cons.conf.idx <- as.numeric(bank$cons.conf.idx) 
bank$nr.employed <- as.numeric(bank$nr.employed)

social <- bank %>%
  select(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed)

library(corrplot)
## corrplot 0.84 loaded
cor_social <- cor(social)
corrplot(cor_social, method = 'number', type = 'upper')

Since euribor3m is strongly correlated (almost perfectly colinear) with emp.var.rate and nr.employed we remove this from the further analysis.

bank <- select(bank, -c('euribor3m'))

Information Value

Although most of the association of the dependent variables with the outcome variable were already checked, let’s use the Information Value to check which variables can we expect to have to strongest impact on Y.

library("Information")
bank$y <- ifelse(bank$y == '1', 1, 0)
IV <- create_infotables(data = bank, y = "y", bins = 5, parallel = FALSE)
IV$Summary
##          Variable          IV
## 14    nr.employed 1.171818847
## 11   emp.var.rate 0.913439490
## 15   pdays_binary 0.555002409
## 10       poutcome 0.550876798
## 6           month 0.484687904
## 12 cons.price.idx 0.385989656
## 9        previous 0.342814339
## 5         contact 0.251907553
## 13  cons.conf.idx 0.230233359
## 1             age 0.193432291
## 2             job 0.190651464
## 4       education 0.047426068
## 8        campaign 0.035463010
## 3         marital 0.027303022
## 7     day_of_week 0.006175579

Most of the social and economic attributes have very high information values that indicates strong predictive power. The same can be said about some variables concerning the outcome of the previous campaign, namely: poutcome, pdays_binary and previous. We decided to discard the day_of_week variable, since it is regarded that variables with IV’s lower than 0.02 have no predictive power at all.

bank <- select(bank, -c("day_of_week"))

Data division, modelling and results

We divide the data into training and test samples in proportions 80:20.

set.seed(103)
bank <- bank[sample(nrow(bank)), ] 
bank$y <- as.factor(bank$y)

train_proportion <- 0.8
train_index <- runif(nrow(bank)) < train_proportion
train <- bank[train_index,]
test <- bank[!train_index,]

test$y <- as.factor(ifelse(test$y == 1, "Yes", "No"))
train$y <- as.factor(ifelse(train$y == 1, "Yes", "No"))
library(rpart)
library(rpart.plot) 
library(ROCR)
library(caret) 
## Loading required package: lattice

Decision Tree

Now it’s time to build the our first model - the Decision Tree. We will use the ‘trainControl’ function from the caret package to determine the best complexity parameter for the tree. 10-fold cross validation was used to check the behaviour of different values of complexity parameter in different samples. With the rpart function we build the decison tree, in which we use the estimated value of cp that bring the best results and finally we plot the tree.

set.seed(123)

fitControl <- trainControl(method = 'cv',
                           number = 10,
                           classProbs = T,
                           summaryFunction = twoClassSummary)

tree <- train(y ~ .,
              data = train,
              trControl = fitControl,
              metric = "ROC",
              method = "rpart")

tree_fitted <- predict(tree, train, type = "prob")
tree_fitted_test <- predict(tree, test, type = "prob")

pred <- factor(ifelse(tree_fitted[, "Yes"] > 0.2, "Yes", "No"))
confusionMatrix(pred, train$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  26690  1964
##        Yes  2163  1761
##                                           
##                Accuracy : 0.8733          
##                  95% CI : (0.8697, 0.8769)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1.000000        
##                                           
##                   Kappa : 0.3887          
##                                           
##  Mcnemar's Test P-Value : 0.002055        
##                                           
##             Sensitivity : 0.47275         
##             Specificity : 0.92503         
##          Pos Pred Value : 0.44878         
##          Neg Pred Value : 0.93146         
##              Prevalence : 0.11434         
##          Detection Rate : 0.05405         
##    Detection Prevalence : 0.12045         
##       Balanced Accuracy : 0.69889         
##                                           
##        'Positive' Class : Yes             
## 
pred_test <- factor(ifelse(tree_fitted_test[, "Yes"] > 0.2, "Yes", "No"))
confusionMatrix(pred_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6773  447
##        Yes  553  418
##                                           
##                Accuracy : 0.8779          
##                  95% CI : (0.8706, 0.8849)
##     No Information Rate : 0.8944          
##     P-Value [Acc > NIR] : 0.9999991       
##                                           
##                   Kappa : 0.3868          
##                                           
##  Mcnemar's Test P-Value : 0.0008989       
##                                           
##             Sensitivity : 0.48324         
##             Specificity : 0.92452         
##          Pos Pred Value : 0.43048         
##          Neg Pred Value : 0.93809         
##              Prevalence : 0.10560         
##          Detection Rate : 0.05103         
##    Detection Prevalence : 0.11854         
##       Balanced Accuracy : 0.70388         
##                                           
##        'Positive' Class : Yes             
## 

Evaluation of the Decision Tree results

The 0.20 probability cut off point was used to classify observations (as the most optimal trade off between specificity and sensitivity).

In the training sample, we obtained the following results: • Sensitivity = 47.2% • Specificity = 92.5% • Balanced accuracy = 0.699

In the test sample: • Sensitivity = 48.3% • Specificity = 92.4% • Balanced accuracy = 0.703

Surprisingly the model did better on the test sample than on the training sample.

Logistic Regression

Our next model - logisitc regression will be build again with the train function. This time we use 5-fold cross validation (repeated 3 times), to achieve comprehensive results of the prediction tested on different parts of the dataset.

set.seed(123)

ctrl_cv5x3a <- trainControl(method = "repeatedcv",
                            number = 5,
                            classProbs = TRUE,
                            summaryFunction = twoClassSummary,
                            repeats = 3)

logit_train <- train(y ~.,
                     data = train,
                     method = "glm",
                     family = "binomial",
                     trControl = ctrl_cv5x3a)

logit_train_fitted <- predict(logit_train, train, type = "prob")
logit_test_fitted <- predict(logit_train, test, type = "prob")

pred <- factor(ifelse(logit_train_fitted[, "Yes"] > 0.19, "Yes", "No"))
confusionMatrix(pred, train$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  25806  1581
##        Yes  3047  2144
##                                              
##                Accuracy : 0.8579             
##                  95% CI : (0.8541, 0.8617)   
##     No Information Rate : 0.8857             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.4012             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.57557            
##             Specificity : 0.89440            
##          Pos Pred Value : 0.41302            
##          Neg Pred Value : 0.94227            
##              Prevalence : 0.11434            
##          Detection Rate : 0.06581            
##    Detection Prevalence : 0.15934            
##       Balanced Accuracy : 0.73498            
##                                              
##        'Positive' Class : Yes                
## 
pred_test <- factor(ifelse(logit_test_fitted[, "Yes"] > 0.19, "Yes", "No"))
confusionMatrix(pred_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6546  375
##        Yes  780  490
##                                              
##                Accuracy : 0.859              
##                  95% CI : (0.8513, 0.8665)   
##     No Information Rate : 0.8944             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3813             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.56647            
##             Specificity : 0.89353            
##          Pos Pred Value : 0.38583            
##          Neg Pred Value : 0.94582            
##              Prevalence : 0.10560            
##          Detection Rate : 0.05982            
##    Detection Prevalence : 0.15505            
##       Balanced Accuracy : 0.73000            
##                                              
##        'Positive' Class : Yes                
## 

Evaluation of the logistic regression results

The 0.19 probability cut off point was used to classify observations (as the most optimal trade off between specificity and sensitivity). In the training sample sensitivity equals around 57.6% and specificity 89.4%. Balanced accuracy is therefore equal to 0.735.

In the test sample the results are slightly worse, but still reasonably good. The Sensitivity dropped by around 1% point to 56.6% and specificity stayed pretty much the same (89.4%). Balanced accuracy was 0.73.

K-nearest neighbours

The final classification model is the k-nearest neighbours. This method is the most computationally expensive from the ones used in this project. This time in the train function we have define additional parameters (except cross validation). Firstly - trControl to conduct 5-fold cross validation. Secondly - ‘preProcess’ - to scale the variables. Applying these parameters will help to achieve more accurate predictions.

Beforehand, with the parameter ‘tuneGrid’ we defined the optimal number of hyperparameter k = 86. As this operation is very computationally expensive we will not repeat this operation in this paper.

set.seed(123)

ctrl_cv5 <- trainControl(method = "cv",
                         number = 5)
different_k <- data.frame(k = 86)

train_knn_tuned <- 
  train(y ~ .,
        data = train,
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = different_k,
        preProcess = c("range")
  )

train_knn_tuned_forecast <- predict(train_knn_tuned, train, type = "prob")
test_knn_tuned_forecast <- predict(train_knn_tuned, test, type = "prob")

pred_knn <- factor(ifelse(train_knn_tuned_forecast[, "Yes"] > 0.15, "Yes", "No"))
confusionMatrix(pred_knn, train$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  26159  1691
##        Yes  2694  2034
##                                              
##                Accuracy : 0.8654             
##                  95% CI : (0.8616, 0.8691)   
##     No Information Rate : 0.8857             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.4052             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.54604            
##             Specificity : 0.90663            
##          Pos Pred Value : 0.43020            
##          Neg Pred Value : 0.93928            
##              Prevalence : 0.11434            
##          Detection Rate : 0.06243            
##    Detection Prevalence : 0.14513            
##       Balanced Accuracy : 0.72634            
##                                              
##        'Positive' Class : Yes                
## 
pred_knn_test <- factor(ifelse(test_knn_tuned_forecast[, "Yes"] > 0.15, "Yes", "No"))
confusionMatrix(pred_knn_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6631  414
##        Yes  695  451
##                                              
##                Accuracy : 0.8646             
##                  95% CI : (0.857, 0.8719)    
##     No Information Rate : 0.8944             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3731             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.52139            
##             Specificity : 0.90513            
##          Pos Pred Value : 0.39354            
##          Neg Pred Value : 0.94123            
##              Prevalence : 0.10560            
##          Detection Rate : 0.05506            
##    Detection Prevalence : 0.13991            
##       Balanced Accuracy : 0.71326            
##                                              
##        'Positive' Class : Yes                
## 

Evaluation of the KNN results

The 0.15 probability cut off point was used to classify observations.

The results of KNN in the training sample: • Sensitivity = 54.6% • Specificity = 90.6% • Balanced accuracy = 72.6%

Results in test sample: • Sensitivity = 52.1% • Specificity = 90.5% • Balanced accuracy = 71.3%

Comparison of the models and conclusions

Out of the 3 different models presented in this report, logistic regression did the best job in determining true positive values and has the best balanced accuracy. Results of the KNN model are also satisfactory, although it is not as precise as logistic regression in determining true positives (sensitivity of model on the test sample was more than 4 % points worse than in case of logisitc regression). Lastly, the decision tree, while having the highest specificity (best ratio of predicted true negatives), it did much worse with predicting the positive class (around 8 % points less than logistic regression).

All things considered if I were to choose the model for this portugese bank to predict whether the customers will subscribe to a term deposit I will opt for the one obtained with logisitc regression method. There are other algorithms that can possibly bring even better results than those presented in this paper. For example we tried to use Support Vector Machine algorithm, which turned out to be too computationally expensive but can be used by someone who has better CPU and RAM. Random Forest, Gradient Boosting or aforementioned SVM can be the methods worth to use by other researchers who will try to achieve best possible predictions in the future.