Banking Deposit Investment Classification Using Logistic Regression and K-NN

Introduction

There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.

In this report, we are going to answer the bank’s problem using Logistic Regression and K-Nearest Neighbor. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, Model Building & Prediction, Model Evaluation and Conclusion.

Data Preparation

# Library Input
library(dplyr)
library(MLmetrics)
library(performance)
library(ggplot2)
library(lmtest)
library(car)
library(caret)
library(partykit)
library(class)

# Data Input
bank <- read.csv("data/bank_investments.csv", stringsAsFactors = T)

head(bank)

##   age          job  marital         education default housing loan   contact
## 1  49  blue-collar  married          basic.9y unknown      no   no  cellular
## 2  37 entrepreneur  married university.degree      no      no   no telephone
## 3  78      retired  married          basic.4y      no      no   no  cellular
## 4  36       admin.  married university.degree      no     yes   no telephone
## 5  59      retired divorced university.degree      no      no   no  cellular
## 6  29       admin.   single university.degree      no      no   no  cellular
##   month day_of_week duration campaign pdays previous    poutcome   y
## 1   nov         wed      227        4   999        0 nonexistent  no
## 2   nov         wed      202        2   999        1     failure  no
## 3   jul         mon     1148        1   999        0 nonexistent yes
## 4   may         mon      120        2   999        0 nonexistent  no
## 5   jun         tue      368        2   999        0 nonexistent  no
## 6   aug         wed      256        2   999        0 nonexistent  no

The data used in this report contains 32950 data and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010). The data set consists of several variables with the following details:

age : the age of the client
job : type of job
marital : marital status
education: client last education
default : whether or not the client has credit in default
balance : the balance in client account
housing : does the client has housing loan?
loan : does the client has personal loan?
contact : contact communication type
month : last contact month of year
day_of_week: last contact day of the week
duration : last contact duration, in seconds
campaign : number of contacts performed during this campaign and for this client (includes last contact)
pdays : number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome : outcome of the previous marketing campaign

Target Variable:

y : has the client subscribed a term deposit?

# Checking Missing Values
colSums(is.na(bank))

##         age         job     marital   education     default     housing 
##           0           0           0           0           0           0 
##        loan     contact       month day_of_week    duration    campaign 
##           0           0           0           0           0           0 
##       pdays    previous    poutcome           y 
##           0           0           0           0

All data types have been converted to the desired data types and there’s no more missing value.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.

Based on the chart above, it can be seen that the customer with the job type admin are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with married status are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with univerity.degree are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with no default credit status are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with housing_loan are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with no personal loan are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer with cullular communication type are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer contacted on may are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer contacted on Thursday are most likely to accept the investment offers.

Based on the chart above, it can be seen that the customer that has nonexistent previous campaign attempt are most likely to accept the investment offers.

Data Pre-Processing

In this process, we will split the data into train data set and test data set. The train data set will be used to build models, while the test data set will be used to predict the target variable. We will took 80% of the data as the train data set and the rest will be used as the test data set.

# Splitting the data set
RNGkind(sample.kind = "Rounding")
set.seed(123) 

index_bank <- sample(x = nrow(bank) , size = nrow(bank)*0.8) 
bank_train <- bank[index_bank, ]
bank_test <- bank[-index_bank, ]

Next, we will check the data proportion, whether or not the data is imbalance.

# Checking data proportion
prop.table(table(bank$y))

## 
##        no       yes 
## 0.8873445 0.1126555

# Checking data proportion of train data set
prop.table(table(bank_train$y))

## 
##        no       yes 
## 0.8876328 0.1123672

# Checking data proportion of test data set
prop.table(table(bank_test$y))

## 
##        no       yes 
## 0.8861912 0.1138088

Since the class distribution in the target variable is around 89 : 11 indicating an imbalance dataset, we need to resample it.

# Downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)

bank_train_down <- downSample(x = bank_train %>% select(-y),
                              y = bank_train$y,
                              yname = "y")

prop.table(table(bank_train_down$y))

## 
##  no yes 
## 0.5 0.5

A balanced proportion of classes is important for the data train because we will be training the model using the data train.

Model Building & Prediction

Logistic Regression

# Model Building (all variables)
model_lr <- glm(formula = y ~ . , data = bank_train_down, family = "binomial")
summary(model_lr)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_down)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4967  -0.5593  -0.0008   0.5494   2.6987  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   7.995e-02  6.539e-01   0.122 0.902698    
## age                           4.647e-03  4.454e-03   1.043 0.296771    
## jobblue-collar               -4.839e-01  1.452e-01  -3.333 0.000861 ***
## jobentrepreneur              -5.441e-01  2.212e-01  -2.460 0.013894 *  
## jobhousemaid                 -1.835e-01  2.664e-01  -0.689 0.491014    
## jobmanagement                -2.137e-01  1.541e-01  -1.386 0.165640    
## jobretired                    9.388e-01  2.048e-01   4.584 4.57e-06 ***
## jobself-employed              3.081e-02  2.243e-01   0.137 0.890730    
## jobservices                  -2.338e-01  1.550e-01  -1.508 0.131485    
## jobstudent                    7.246e-01  2.329e-01   3.112 0.001859 ** 
## jobtechnician                -1.365e-01  1.314e-01  -1.039 0.298835    
## jobunemployed                 3.228e-02  2.453e-01   0.132 0.895289    
## jobunknown                    5.796e-01  4.581e-01   1.265 0.205771    
## maritalmarried                1.984e-01  1.268e-01   1.565 0.117637    
## maritalsingle                 4.292e-01  1.440e-01   2.980 0.002879 ** 
## maritalunknown                1.590e+00  9.960e-01   1.597 0.110320    
## educationbasic.6y            -1.288e-01  2.238e-01  -0.576 0.564849    
## educationbasic.9y            -3.358e-01  1.783e-01  -1.883 0.059701 .  
## educationhigh.school         -7.043e-02  1.717e-01  -0.410 0.681597    
## educationilliterate           1.404e+01  2.430e+02   0.058 0.953902    
## educationprofessional.course -5.801e-02  1.935e-01  -0.300 0.764358    
## educationuniversity.degree    1.543e-01  1.745e-01   0.884 0.376638    
## educationunknown              6.021e-02  2.263e-01   0.266 0.790171    
## defaultunknown               -7.506e-01  1.213e-01  -6.187 6.14e-10 ***
## defaultyes                   -1.216e+01  5.354e+02  -0.023 0.981884    
## housingunknown                7.847e-02  2.568e-01   0.306 0.759938    
## housingyes                   -7.170e-02  7.661e-02  -0.936 0.349311    
## loanunknown                          NA         NA      NA       NA    
## loanyes                       6.327e-03  1.059e-01   0.060 0.952371    
## contacttelephone             -1.275e+00  1.031e-01 -12.359  < 2e-16 ***
## monthaug                     -1.205e+00  1.559e-01  -7.729 1.09e-14 ***
## monthdec                      1.744e+00  5.951e-01   2.930 0.003392 ** 
## monthjul                     -1.358e+00  1.589e-01  -8.544  < 2e-16 ***
## monthjun                     -2.268e-01  1.678e-01  -1.352 0.176529    
## monthmar                      1.231e+00  2.515e-01   4.895 9.84e-07 ***
## monthmay                     -1.284e+00  1.498e-01  -8.573  < 2e-16 ***
## monthnov                     -1.125e+00  1.708e-01  -6.588 4.47e-11 ***
## monthoct                      1.033e+00  2.416e-01   4.278 1.89e-05 ***
## monthsep                      2.781e-01  2.556e-01   1.088 0.276481    
## day_of_weekmon               -1.459e-01  1.183e-01  -1.233 0.217546    
## day_of_weekthu               -8.816e-02  1.199e-01  -0.735 0.462071    
## day_of_weektue                4.548e-02  1.204e-01   0.378 0.705524    
## day_of_weekwed                8.585e-02  1.205e-01   0.712 0.476261    
## duration                      6.169e-03  1.870e-04  32.985  < 2e-16 ***
## campaign                     -7.285e-02  2.091e-02  -3.485 0.000493 ***
## pdays                        -1.481e-03  4.904e-04  -3.021 0.002520 ** 
## previous                      2.273e-01  1.477e-01   1.539 0.123718    
## poutcomenonexistent           9.715e-02  2.039e-01   0.476 0.633728    
## poutcomesuccess               8.936e-01  4.812e-01   1.857 0.063332 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8212.4  on 5923  degrees of freedom
## Residual deviance: 4526.2  on 5876  degrees of freedom
## AIC: 4622.2
## 
## Number of Fisher Scoring iterations: 12

# Feature Selection (backward)
model_lr2 <- step(object = model_lr, direction="backward", trace = 0)

summary(model_lr2)

## 
## Call:
## glm(formula = y ~ job + marital + education + default + contact + 
##     month + duration + campaign + pdays + previous, family = "binomial", 
##     data = bank_train_down)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4719  -0.5608  -0.0008   0.5548   2.6753  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.172e+00  3.131e-01   3.742 0.000182 ***
## jobblue-collar               -4.828e-01  1.445e-01  -3.341 0.000834 ***
## jobentrepreneur              -5.397e-01  2.208e-01  -2.444 0.014516 *  
## jobhousemaid                 -1.488e-01  2.640e-01  -0.563 0.573185    
## jobmanagement                -2.023e-01  1.533e-01  -1.320 0.186935    
## jobretired                    1.047e+00  1.825e-01   5.734 9.82e-09 ***
## jobself-employed              1.855e-02  2.238e-01   0.083 0.933921    
## jobservices                  -2.363e-01  1.546e-01  -1.529 0.126312    
## jobstudent                    6.753e-01  2.283e-01   2.958 0.003099 ** 
## jobtechnician                -1.381e-01  1.312e-01  -1.053 0.292194    
## jobunemployed                 3.684e-02  2.448e-01   0.150 0.880403    
## jobunknown                    6.017e-01  4.580e-01   1.314 0.188977    
## maritalmarried                1.846e-01  1.258e-01   1.467 0.142352    
## maritalsingle                 3.834e-01  1.358e-01   2.824 0.004747 ** 
## maritalunknown                1.523e+00  9.954e-01   1.530 0.125921    
## educationbasic.6y            -1.470e-01  2.222e-01  -0.662 0.508143    
## educationbasic.9y            -3.685e-01  1.758e-01  -2.096 0.036067 *  
## educationhigh.school         -1.031e-01  1.688e-01  -0.611 0.541411    
## educationilliterate           1.402e+01  2.422e+02   0.058 0.953843    
## educationprofessional.course -8.191e-02  1.914e-01  -0.428 0.668733    
## educationuniversity.degree    1.220e-01  1.712e-01   0.713 0.475913    
## educationunknown              5.832e-02  2.253e-01   0.259 0.795787    
## defaultunknown               -7.409e-01  1.203e-01  -6.157 7.40e-10 ***
## defaultyes                   -1.211e+01  5.354e+02  -0.023 0.981950    
## contacttelephone             -1.257e+00  1.017e-01 -12.364  < 2e-16 ***
## monthaug                     -1.183e+00  1.541e-01  -7.676 1.64e-14 ***
## monthdec                      1.781e+00  5.918e-01   3.009 0.002620 ** 
## monthjul                     -1.340e+00  1.572e-01  -8.524  < 2e-16 ***
## monthjun                     -2.212e-01  1.662e-01  -1.331 0.183168    
## monthmar                      1.235e+00  2.508e-01   4.926 8.41e-07 ***
## monthmay                     -1.271e+00  1.484e-01  -8.565  < 2e-16 ***
## monthnov                     -1.111e+00  1.699e-01  -6.537 6.27e-11 ***
## monthoct                      1.044e+00  2.404e-01   4.343 1.41e-05 ***
## monthsep                      3.105e-01  2.543e-01   1.221 0.222084    
## duration                      6.170e-03  1.867e-04  33.040  < 2e-16 ***
## campaign                     -7.388e-02  2.092e-02  -3.532 0.000413 ***
## pdays                        -2.325e-03  1.983e-04 -11.727  < 2e-16 ***
## previous                      1.541e-01  8.165e-02   1.887 0.059123 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8212.4  on 5923  degrees of freedom
## Residual deviance: 4537.0  on 5886  degrees of freedom
## AIC: 4613
## 
## Number of Fisher Scoring iterations: 12

The summary of the current logistic regression model can be seen above. Coefficients that have p-value below 0.05 indicate that they are significant and should not be removed from the equation. And based on the result, it can be seen that the most significant variables in predicting customer’s investment decision is duration, contact and month.

# Predict Data Test
pred_lr <- predict(object = model_lr2, newdata = bank_test, type = "response")
pred_lr_label <- as.factor(ifelse(pred_lr > 0.5, "yes", "no"))
table(predict = pred_lr_label, 
      actual = bank_test$y)

##        actual
## predict   no  yes
##     no  4968  137
##     yes  872  613

K-Nearest Neighbor

Before building the model of K-Nearest Neighbor, we need to separate the target variable and the predictors.

# Data train predictors
bank_train_x <- bank_train_down %>% 
            select_if(is.numeric) 

# Data train target
bank_train_y <- bank_train_down %>% 
           select(y) 

# Data test predictors
bank_test_x <- bank_test %>% 
  select_if(is.numeric)

# Data test target
bank_test_y <- bank_test %>% 
  select(y)

Next, we need to make sure that the dataset has the same scale. This is because KNN classify based on distance, hence the range of each variable must be the same. If there is a variable with a very large range that differs greatly compared to the other variables, then that variable will greatly affect the classification results and ignore other variables. Therefore it is necessary to do scaling.

# Data Summary
summary(bank)

##       age                 job           marital                    education   
##  Min.   :17.00   admin.     :8314   divorced: 3675   university.degree  :9736  
##  1st Qu.:32.00   blue-collar:7441   married :19953   high.school        :7596  
##  Median :38.00   technician :5400   single  : 9257   basic.9y           :4826  
##  Mean   :40.01   services   :3196   unknown :   65   professional.course:4192  
##  3rd Qu.:47.00   management :2345                    basic.4y           :3322  
##  Max.   :98.00   retired    :1366                    basic.6y           :1865  
##                  (Other)    :4888                    (Other)            :1413  
##     default         housing           loan            contact     
##  no     :26007   no     :14900   no     :27131   cellular :20908  
##  unknown: 6940   unknown:  796   unknown:  796   telephone:12042  
##  yes    :    3   yes    :17254   yes    : 5023                    
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##      month       day_of_week    duration         campaign          pdays      
##  may    :11011   fri:6322    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  jul    : 5763   mon:6812    1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0  
##  aug    : 4948   thu:6857    Median : 180.0   Median : 2.000   Median :999.0  
##  jun    : 4247   tue:6444    Mean   : 258.1   Mean   : 2.561   Mean   :962.1  
##  nov    : 3266   wed:6515    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  apr    : 2085               Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##  (Other): 1630                                                                
##     previous             poutcome       y        
##  Min.   :0.0000   failure    : 3429   no :29238  
##  1st Qu.:0.0000   nonexistent:28416   yes: 3712  
##  Median :0.0000   success    : 1105              
##  Mean   :0.1747                                  
##  3rd Qu.:0.0000                                  
##  Max.   :7.0000                                  
##

# Scaling
bank_train_x_scaled <- scale(bank_train_x)

bank_test_x_scaled <- scale(bank_test_x,
                center = attr(bank_train_x_scaled,"scaled:center"),
                scale = attr(bank_train_x_scaled, "scaled:scale"))

Scaling process has been done. Next, the value of K will be determined. Selection of the value of K will be obtained from the root of the total observation (data train).

round(sqrt(nrow(bank_train_down)), 0)

## [1] 77

K = 77

# Model Building
pred_knn <-  knn(train = bank_train_x_scaled,
                  test = bank_test_x_scaled, 
                  cl = bank_train_y$y,
                  k = 77)

Model Evaluation

Logistic Regression

confusionMatrix(data = pred_lr_label, reference = bank_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4968  137
##        yes  872  613
##                                          
##                Accuracy : 0.8469         
##                  95% CI : (0.838, 0.8555)
##     No Information Rate : 0.8862         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4681         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.81733        
##             Specificity : 0.85068        
##          Pos Pred Value : 0.41279        
##          Neg Pred Value : 0.97316        
##              Prevalence : 0.11381        
##          Detection Rate : 0.09302        
##    Detection Prevalence : 0.22534        
##       Balanced Accuracy : 0.83401        
##                                          
##        'Positive' Class : yes            
##

Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 85%. However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In our case, because we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which is around 82%.

K-Nearest Neighbor

confusionMatrix(data = pred_knn, reference = bank_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4939  210
##        yes  901  540
##                                           
##                Accuracy : 0.8314          
##                  95% CI : (0.8222, 0.8404)
##     No Information Rate : 0.8862          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4037          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72000         
##             Specificity : 0.84572         
##          Pos Pred Value : 0.37474         
##          Neg Pred Value : 0.95922         
##              Prevalence : 0.11381         
##          Detection Rate : 0.08194         
##    Detection Prevalence : 0.21866         
##       Balanced Accuracy : 0.78286         
##                                           
##        'Positive' Class : yes             
##

Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 83%. However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In our case, because we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which is around 72%.

Conclusion

After predicting with two models, the Logistic Regression model and K-Nearest Neighbor model, there’s no significant difference in term of Accuracy. The Logistic Regression model has 85% of Accuracy`, while the K-Nearest Neighbor model has 83%.

However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In this case we would want to approach / offer as many customer as we can. So, we tend to rely on Sensitivity. In this case, the Logistic Regression model is better, it has 82% of Sensitivity compared to K-Nearest Neighbor model that has 72%.

Another interesting insight from this report, based on the Logistic Regression model, it can be seen that the most significant variables in predicting customer’s investment decision is duration, contact and month.