Banking Deposit Investment Classification Using Logistic Regression and K-NN
Introduction
There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.
In this report, we are going to answer the bank’s problem using Logistic Regression and K-Nearest Neighbor. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, Model Building & Prediction, Model Evaluation and Conclusion.
Data Preparation
# Library Input
library(dplyr)
library(MLmetrics)
library(performance)
library(ggplot2)
library(lmtest)
library(car)
library(caret)
library(partykit)
library(class)# Data Input
bank <- read.csv("data/bank_investments.csv", stringsAsFactors = T)
head(bank)## age job marital education default housing loan contact
## 1 49 blue-collar married basic.9y unknown no no cellular
## 2 37 entrepreneur married university.degree no no no telephone
## 3 78 retired married basic.4y no no no cellular
## 4 36 admin. married university.degree no yes no telephone
## 5 59 retired divorced university.degree no no no cellular
## 6 29 admin. single university.degree no no no cellular
## month day_of_week duration campaign pdays previous poutcome y
## 1 nov wed 227 4 999 0 nonexistent no
## 2 nov wed 202 2 999 1 failure no
## 3 jul mon 1148 1 999 0 nonexistent yes
## 4 may mon 120 2 999 0 nonexistent no
## 5 jun tue 368 2 999 0 nonexistent no
## 6 aug wed 256 2 999 0 nonexistent no
The data used in this report contains 32950 data and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010). The data set consists of several variables with the following details:
age: the age of the clientjob: type of jobmarital: marital statuseducation: client last educationdefault: whether or not the client has credit in defaultbalance: the balance in client accounthousing: does the client has housing loan?loan: does the client has personal loan?contact: contact communication typemonth: last contact month of yearday_of_week: last contact day of the weekduration: last contact duration, in secondscampaign: number of contacts performed during this campaign and for this client (includes last contact)pdays: number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)previous: number of contacts performed before this campaign and for this clientpoutcome: outcome of the previous marketing campaign
Target Variable:
y: has the client subscribed a term deposit?
# Checking Missing Values
colSums(is.na(bank))## age job marital education default housing
## 0 0 0 0 0 0
## loan contact month day_of_week duration campaign
## 0 0 0 0 0 0
## pdays previous poutcome y
## 0 0 0 0
All data types have been converted to the desired data types and there’s no more missing value.
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.
Based on the chart above, it can be seen that the customer with the job type admin are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with married status are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with univerity.degree are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with no default credit status are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with housing_loan are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with no personal loan are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer with cullular communication type are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer contacted on may are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer contacted on Thursday are most likely to accept the investment offers.
Based on the chart above, it can be seen that the customer that has nonexistent previous campaign attempt are most likely to accept the investment offers.
Data Pre-Processing
In this process, we will split the data into train data set and test data set. The train data set will be used to build models, while the test data set will be used to predict the target variable. We will took 80% of the data as the train data set and the rest will be used as the test data set.
# Splitting the data set
RNGkind(sample.kind = "Rounding")
set.seed(123)
index_bank <- sample(x = nrow(bank) , size = nrow(bank)*0.8)
bank_train <- bank[index_bank, ]
bank_test <- bank[-index_bank, ]Next, we will check the data proportion, whether or not the data is imbalance.
# Checking data proportion
prop.table(table(bank$y))##
## no yes
## 0.8873445 0.1126555
# Checking data proportion of train data set
prop.table(table(bank_train$y))##
## no yes
## 0.8876328 0.1123672
# Checking data proportion of test data set
prop.table(table(bank_test$y))##
## no yes
## 0.8861912 0.1138088
Since the class distribution in the target variable is around 89 : 11 indicating an imbalance dataset, we need to resample it.
# Downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)
bank_train_down <- downSample(x = bank_train %>% select(-y),
y = bank_train$y,
yname = "y")prop.table(table(bank_train_down$y))##
## no yes
## 0.5 0.5
A balanced proportion of classes is important for the data train because we will be training the model using the data train.
Model Building & Prediction
Logistic Regression
# Model Building (all variables)
model_lr <- glm(formula = y ~ . , data = bank_train_down, family = "binomial")
summary(model_lr)##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_down)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4967 -0.5593 -0.0008 0.5494 2.6987
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.995e-02 6.539e-01 0.122 0.902698
## age 4.647e-03 4.454e-03 1.043 0.296771
## jobblue-collar -4.839e-01 1.452e-01 -3.333 0.000861 ***
## jobentrepreneur -5.441e-01 2.212e-01 -2.460 0.013894 *
## jobhousemaid -1.835e-01 2.664e-01 -0.689 0.491014
## jobmanagement -2.137e-01 1.541e-01 -1.386 0.165640
## jobretired 9.388e-01 2.048e-01 4.584 4.57e-06 ***
## jobself-employed 3.081e-02 2.243e-01 0.137 0.890730
## jobservices -2.338e-01 1.550e-01 -1.508 0.131485
## jobstudent 7.246e-01 2.329e-01 3.112 0.001859 **
## jobtechnician -1.365e-01 1.314e-01 -1.039 0.298835
## jobunemployed 3.228e-02 2.453e-01 0.132 0.895289
## jobunknown 5.796e-01 4.581e-01 1.265 0.205771
## maritalmarried 1.984e-01 1.268e-01 1.565 0.117637
## maritalsingle 4.292e-01 1.440e-01 2.980 0.002879 **
## maritalunknown 1.590e+00 9.960e-01 1.597 0.110320
## educationbasic.6y -1.288e-01 2.238e-01 -0.576 0.564849
## educationbasic.9y -3.358e-01 1.783e-01 -1.883 0.059701 .
## educationhigh.school -7.043e-02 1.717e-01 -0.410 0.681597
## educationilliterate 1.404e+01 2.430e+02 0.058 0.953902
## educationprofessional.course -5.801e-02 1.935e-01 -0.300 0.764358
## educationuniversity.degree 1.543e-01 1.745e-01 0.884 0.376638
## educationunknown 6.021e-02 2.263e-01 0.266 0.790171
## defaultunknown -7.506e-01 1.213e-01 -6.187 6.14e-10 ***
## defaultyes -1.216e+01 5.354e+02 -0.023 0.981884
## housingunknown 7.847e-02 2.568e-01 0.306 0.759938
## housingyes -7.170e-02 7.661e-02 -0.936 0.349311
## loanunknown NA NA NA NA
## loanyes 6.327e-03 1.059e-01 0.060 0.952371
## contacttelephone -1.275e+00 1.031e-01 -12.359 < 2e-16 ***
## monthaug -1.205e+00 1.559e-01 -7.729 1.09e-14 ***
## monthdec 1.744e+00 5.951e-01 2.930 0.003392 **
## monthjul -1.358e+00 1.589e-01 -8.544 < 2e-16 ***
## monthjun -2.268e-01 1.678e-01 -1.352 0.176529
## monthmar 1.231e+00 2.515e-01 4.895 9.84e-07 ***
## monthmay -1.284e+00 1.498e-01 -8.573 < 2e-16 ***
## monthnov -1.125e+00 1.708e-01 -6.588 4.47e-11 ***
## monthoct 1.033e+00 2.416e-01 4.278 1.89e-05 ***
## monthsep 2.781e-01 2.556e-01 1.088 0.276481
## day_of_weekmon -1.459e-01 1.183e-01 -1.233 0.217546
## day_of_weekthu -8.816e-02 1.199e-01 -0.735 0.462071
## day_of_weektue 4.548e-02 1.204e-01 0.378 0.705524
## day_of_weekwed 8.585e-02 1.205e-01 0.712 0.476261
## duration 6.169e-03 1.870e-04 32.985 < 2e-16 ***
## campaign -7.285e-02 2.091e-02 -3.485 0.000493 ***
## pdays -1.481e-03 4.904e-04 -3.021 0.002520 **
## previous 2.273e-01 1.477e-01 1.539 0.123718
## poutcomenonexistent 9.715e-02 2.039e-01 0.476 0.633728
## poutcomesuccess 8.936e-01 4.812e-01 1.857 0.063332 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8212.4 on 5923 degrees of freedom
## Residual deviance: 4526.2 on 5876 degrees of freedom
## AIC: 4622.2
##
## Number of Fisher Scoring iterations: 12
# Feature Selection (backward)
model_lr2 <- step(object = model_lr, direction="backward", trace = 0)
summary(model_lr2)##
## Call:
## glm(formula = y ~ job + marital + education + default + contact +
## month + duration + campaign + pdays + previous, family = "binomial",
## data = bank_train_down)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4719 -0.5608 -0.0008 0.5548 2.6753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.172e+00 3.131e-01 3.742 0.000182 ***
## jobblue-collar -4.828e-01 1.445e-01 -3.341 0.000834 ***
## jobentrepreneur -5.397e-01 2.208e-01 -2.444 0.014516 *
## jobhousemaid -1.488e-01 2.640e-01 -0.563 0.573185
## jobmanagement -2.023e-01 1.533e-01 -1.320 0.186935
## jobretired 1.047e+00 1.825e-01 5.734 9.82e-09 ***
## jobself-employed 1.855e-02 2.238e-01 0.083 0.933921
## jobservices -2.363e-01 1.546e-01 -1.529 0.126312
## jobstudent 6.753e-01 2.283e-01 2.958 0.003099 **
## jobtechnician -1.381e-01 1.312e-01 -1.053 0.292194
## jobunemployed 3.684e-02 2.448e-01 0.150 0.880403
## jobunknown 6.017e-01 4.580e-01 1.314 0.188977
## maritalmarried 1.846e-01 1.258e-01 1.467 0.142352
## maritalsingle 3.834e-01 1.358e-01 2.824 0.004747 **
## maritalunknown 1.523e+00 9.954e-01 1.530 0.125921
## educationbasic.6y -1.470e-01 2.222e-01 -0.662 0.508143
## educationbasic.9y -3.685e-01 1.758e-01 -2.096 0.036067 *
## educationhigh.school -1.031e-01 1.688e-01 -0.611 0.541411
## educationilliterate 1.402e+01 2.422e+02 0.058 0.953843
## educationprofessional.course -8.191e-02 1.914e-01 -0.428 0.668733
## educationuniversity.degree 1.220e-01 1.712e-01 0.713 0.475913
## educationunknown 5.832e-02 2.253e-01 0.259 0.795787
## defaultunknown -7.409e-01 1.203e-01 -6.157 7.40e-10 ***
## defaultyes -1.211e+01 5.354e+02 -0.023 0.981950
## contacttelephone -1.257e+00 1.017e-01 -12.364 < 2e-16 ***
## monthaug -1.183e+00 1.541e-01 -7.676 1.64e-14 ***
## monthdec 1.781e+00 5.918e-01 3.009 0.002620 **
## monthjul -1.340e+00 1.572e-01 -8.524 < 2e-16 ***
## monthjun -2.212e-01 1.662e-01 -1.331 0.183168
## monthmar 1.235e+00 2.508e-01 4.926 8.41e-07 ***
## monthmay -1.271e+00 1.484e-01 -8.565 < 2e-16 ***
## monthnov -1.111e+00 1.699e-01 -6.537 6.27e-11 ***
## monthoct 1.044e+00 2.404e-01 4.343 1.41e-05 ***
## monthsep 3.105e-01 2.543e-01 1.221 0.222084
## duration 6.170e-03 1.867e-04 33.040 < 2e-16 ***
## campaign -7.388e-02 2.092e-02 -3.532 0.000413 ***
## pdays -2.325e-03 1.983e-04 -11.727 < 2e-16 ***
## previous 1.541e-01 8.165e-02 1.887 0.059123 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8212.4 on 5923 degrees of freedom
## Residual deviance: 4537.0 on 5886 degrees of freedom
## AIC: 4613
##
## Number of Fisher Scoring iterations: 12
The summary of the current logistic regression model can be seen above. Coefficients that have p-value below 0.05 indicate that they are significant and should not be removed from the equation. And based on the result, it can be seen that the most significant variables in predicting customer’s investment decision is duration, contact and month.
# Predict Data Test
pred_lr <- predict(object = model_lr2, newdata = bank_test, type = "response")
pred_lr_label <- as.factor(ifelse(pred_lr > 0.5, "yes", "no"))
table(predict = pred_lr_label,
actual = bank_test$y)## actual
## predict no yes
## no 4968 137
## yes 872 613
K-Nearest Neighbor
Before building the model of K-Nearest Neighbor, we need to separate the target variable and the predictors.
# Data train predictors
bank_train_x <- bank_train_down %>%
select_if(is.numeric)
# Data train target
bank_train_y <- bank_train_down %>%
select(y)
# Data test predictors
bank_test_x <- bank_test %>%
select_if(is.numeric)
# Data test target
bank_test_y <- bank_test %>%
select(y)Next, we need to make sure that the dataset has the same scale. This is because KNN classify based on distance, hence the range of each variable must be the same. If there is a variable with a very large range that differs greatly compared to the other variables, then that variable will greatly affect the classification results and ignore other variables. Therefore it is necessary to do scaling.
# Data Summary
summary(bank)## age job marital education
## Min. :17.00 admin. :8314 divorced: 3675 university.degree :9736
## 1st Qu.:32.00 blue-collar:7441 married :19953 high.school :7596
## Median :38.00 technician :5400 single : 9257 basic.9y :4826
## Mean :40.01 services :3196 unknown : 65 professional.course:4192
## 3rd Qu.:47.00 management :2345 basic.4y :3322
## Max. :98.00 retired :1366 basic.6y :1865
## (Other) :4888 (Other) :1413
## default housing loan contact
## no :26007 no :14900 no :27131 cellular :20908
## unknown: 6940 unknown: 796 unknown: 796 telephone:12042
## yes : 3 yes :17254 yes : 5023
##
##
##
##
## month day_of_week duration campaign pdays
## may :11011 fri:6322 Min. : 0.0 Min. : 1.000 Min. : 0.0
## jul : 5763 mon:6812 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0
## aug : 4948 thu:6857 Median : 180.0 Median : 2.000 Median :999.0
## jun : 4247 tue:6444 Mean : 258.1 Mean : 2.561 Mean :962.1
## nov : 3266 wed:6515 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## apr : 2085 Max. :4918.0 Max. :56.000 Max. :999.0
## (Other): 1630
## previous poutcome y
## Min. :0.0000 failure : 3429 no :29238
## 1st Qu.:0.0000 nonexistent:28416 yes: 3712
## Median :0.0000 success : 1105
## Mean :0.1747
## 3rd Qu.:0.0000
## Max. :7.0000
##
# Scaling
bank_train_x_scaled <- scale(bank_train_x)
bank_test_x_scaled <- scale(bank_test_x,
center = attr(bank_train_x_scaled,"scaled:center"),
scale = attr(bank_train_x_scaled, "scaled:scale"))Scaling process has been done. Next, the value of K will be determined. Selection of the value of K will be obtained from the root of the total observation (data train).
round(sqrt(nrow(bank_train_down)), 0)## [1] 77
K = 77
# Model Building
pred_knn <- knn(train = bank_train_x_scaled,
test = bank_test_x_scaled,
cl = bank_train_y$y,
k = 77)Model Evaluation
Logistic Regression
confusionMatrix(data = pred_lr_label, reference = bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4968 137
## yes 872 613
##
## Accuracy : 0.8469
## 95% CI : (0.838, 0.8555)
## No Information Rate : 0.8862
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4681
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.81733
## Specificity : 0.85068
## Pos Pred Value : 0.41279
## Neg Pred Value : 0.97316
## Prevalence : 0.11381
## Detection Rate : 0.09302
## Detection Prevalence : 0.22534
## Balanced Accuracy : 0.83401
##
## 'Positive' Class : yes
##
Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 85%. However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In our case, because we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which is around 82%.
K-Nearest Neighbor
confusionMatrix(data = pred_knn, reference = bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4939 210
## yes 901 540
##
## Accuracy : 0.8314
## 95% CI : (0.8222, 0.8404)
## No Information Rate : 0.8862
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4037
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72000
## Specificity : 0.84572
## Pos Pred Value : 0.37474
## Neg Pred Value : 0.95922
## Prevalence : 0.11381
## Detection Rate : 0.08194
## Detection Prevalence : 0.21866
## Balanced Accuracy : 0.78286
##
## 'Positive' Class : yes
##
Based on the Confusion Matrix result, we can conclude that this model is quite good, with the Accuracy around 83%. However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In our case, because we are offering the customer the investment, and in our model we would like a prediction where positive = customer decide to invest (yes), so we tend to rely on Recall / Sensitivity (we would want to approach / offer as many customer as we can) which is around 72%.
Conclusion
After predicting with two models, the Logistic Regression model and K-Nearest Neighbor model, there’s no significant difference in term of Accuracy. The Logistic Regression model has 85% of Accuracy`, while the K-Nearest Neighbor model has 83%.
However, Accuracy will be the most appropriate metric to choose when the data we have is balanced. And we know that the data set we used is NOT, hence other factor should be considered. In this case we would want to approach / offer as many customer as we can. So, we tend to rely on Sensitivity. In this case, the Logistic Regression model is better, it has 82% of Sensitivity compared to K-Nearest Neighbor model that has 72%.
Another interesting insight from this report, based on the Logistic Regression model, it can be seen that the most significant variables in predicting customer’s investment decision is duration, contact and month.