Buying Personal Loans

Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.

Whatever you will see in this analysis, is the result of my study in Classification I class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.

Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me, I’d love to discuss it with you. Thank you.

About the dataset

This dataset is taken from Kaggle website, in a post titled Bank_Loan_modelling. The dataset is posted by Sunil Jacob and contains information about bank customers who bought personal loans.

Setup

We will use several libraries in this analysis, such as tidyverse to make data manipulation easier, caret & class to perform our machine learning modeling and fastDummies to easily create dummy variables.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(fastDummies)
library(class)

Problem Understanding

This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns to better target marketing to increase the success ratio with a minimal budget.

In this analysis, we will try to use classification models to predict the likelihood of a liability customer buying personal loans.

# Import the data
loan <- read.csv("dataset/personal_loan.csv")
head(loan)
##   ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1  1  25          1     49    91107      4   1.6         1        0
## 2  2  45         19     34    90089      3   1.5         1        0
## 3  3  39         15     11    94720      1   1.0         1        0
## 4  4  35          9    100    94112      1   2.7         2        0
## 5  5  35          8     45    91330      4   1.0         2        0
## 6  6  37         13     29    92121      4   0.4         2      155
##   Personal.Loan Securities.Account CD.Account Online CreditCard
## 1             0                  1          0      0          0
## 2             0                  1          0      0          0
## 3             0                  0          0      0          0
## 4             0                  0          0      0          0
## 5             0                  0          0      0          1
## 6             0                  0          0      1          0

Descriptions:

  • ID : Customer ID.
  • Age : Customer’s age in completed years.
  • Experience : #years of professional experience.
  • Income : Annual income of the customer.
  • ZIP.Code : Home Address ZIP code.
  • Family : Family size of the customer.
  • CCAvg : Avg. spending on credit cards per month.
  • Education : Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
  • Mortgage : Value of house mortgage if any.
  • Personal.Loan : Did this customer accept the personal loan offered in the last campaign?
  • Securities.Account : Does the customer have a securities account with the bank?
  • CD.Account : Does the customer have a certificate of deposit (CD) account with the bank?
  • Online : Does the customer use internet banking facilities?
  • CreditCard : Does the customer use a credit card issued by Thera Bank?

Data Wrangling

Check if our dataset is clean and tidy enough to be used in our analysis.

Check Data Structure

str(loan)
## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

Check Missing Values

anyNA(loan)
## [1] FALSE

Our dataset does not contain missing data, however, there are some columns that have data types that do not fit.

Data Cleansing

# Changing data types and removing ID & ZIP.Code columns
loan_clean <- loan %>% 
  mutate_at(vars(Education, Personal.Loan, Securities.Account, CD.Account, Online, CreditCard), as.factor) %>% 
  select(-c(ID, ZIP.Code))

Exploratory Data Analysis

Making initial investigation towards our dataset to find information and make our dataset more useful.

Check the Distribution of Data

summary(loan_clean)
##       Age          Experience       Income           Family     
##  Min.   :23.00   Min.   :-3.0   Min.   :  8.00   Min.   :1.000  
##  1st Qu.:35.00   1st Qu.:10.0   1st Qu.: 39.00   1st Qu.:1.000  
##  Median :45.00   Median :20.0   Median : 64.00   Median :2.000  
##  Mean   :45.34   Mean   :20.1   Mean   : 73.77   Mean   :2.396  
##  3rd Qu.:55.00   3rd Qu.:30.0   3rd Qu.: 98.00   3rd Qu.:3.000  
##  Max.   :67.00   Max.   :43.0   Max.   :224.00   Max.   :4.000  
##      CCAvg        Education    Mortgage     Personal.Loan Securities.Account
##  Min.   : 0.000   1:2096    Min.   :  0.0   0:4520        0:4478            
##  1st Qu.: 0.700   2:1403    1st Qu.:  0.0   1: 480        1: 522            
##  Median : 1.500   3:1501    Median :  0.0                                   
##  Mean   : 1.938             Mean   : 56.5                                   
##  3rd Qu.: 2.500             3rd Qu.:101.0                                   
##  Max.   :10.000             Max.   :635.0                                   
##  CD.Account Online   CreditCard
##  0:4698     0:2016   0:3530    
##  1: 302     1:2984   1:1470    
##                                
##                                
##                                
## 

Judging from the distribution of the data, Securities.Account, and CD.Account mostly have 0 values, while Online and CreditCard provide less information about a person’s financial ability, therefore, we delete these 4 columns.

# Delete the columns
loan_clean <- loan_clean %>% 
  select(-c(Securities.Account, CD.Account, Online, CreditCard))

str(loan_clean)
## 'data.frame':    5000 obs. of  8 variables:
##  $ Age          : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience   : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income       : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ Family       : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg        : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education    : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage     : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...

Check the Proportion of the Target Class

prop.table(table(loan_clean$Personal.Loan))
## 
##     0     1 
## 0.904 0.096

We can see from the summary results, the proportion of our target class is not balanced with a comparison of 90% of class 0 data and 10% of class 1 data. Because the comparison is so far away, we have to balance our data.

# Down sampling our data
set.seed(1996)
loan_balance <- downSample(x = loan_clean[, -ncol(loan_clean)],
                         y = loan_clean$Personal.Loan,
                         yname = "Personal.Loan")

prop.table(table(loan_balance$Personal.Loan))  
## 
##   0   1 
## 0.5 0.5

Cross Validation

Split our dataset into 80% training data and 20% test data.

set.seed(1996)
loan_split <- sample(nrow(loan_balance), nrow(loan_balance)*0.80)
loan_train <- loan_balance[loan_split, ]
loan_test <- loan_balance[-loan_split, ]

Recheck imbalance of out target class.

# Train data
loan_train$Personal.Loan %>% 
  table() %>% 
  prop.table()
## .
##         0         1 
## 0.5143229 0.4856771
# Test data
loan_test$Personal.Loan %>% 
  table() %>% 
  prop.table()
## .
##         0         1 
## 0.4427083 0.5572917

We have succesfully split our data and keep the proportion relative balance.

Logistic Regression

Build Model

# Model with all column as predictors
model0 <- glm(formula = Personal.Loan~., data = loan_train, family = "binomial")
summary(model0)
## 
## Call:
## glm(formula = Personal.Loan ~ ., family = "binomial", data = loan_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.91246  -0.36825  -0.06095   0.41926   3.02754  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.8591606  2.6041079  -3.402 0.000669 ***
## Age         -0.0172420  0.0994261  -0.173 0.862325    
## Experience   0.0239055  0.0989175   0.242 0.809035    
## Income       0.0493039  0.0040555  12.157  < 2e-16 ***
## Family       0.5274635  0.1076423   4.900 9.58e-07 ***
## CCAvg        0.2595508  0.0743680   3.490 0.000483 ***
## Education2   2.4069902  0.3284899   7.327 2.35e-13 ***
## Education3   2.4901757  0.3302423   7.540 4.68e-14 ***
## Mortgage     0.0007514  0.0009687   0.776 0.437923    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1064  on 767  degrees of freedom
## Residual deviance:  464  on 759  degrees of freedom
## AIC: 482
## 
## Number of Fisher Scoring iterations: 6

Model Tuning

As we can see, some variables have no significant impact on our model. We can make our model simpler by removing insignificant variables using stepwise method.

# Make new model with stepwise method
model1 <- step(model0, direction = "backward", trace = F)
summary(model1)
## 
## Call:
## glm(formula = Personal.Loan ~ Income + Family + CCAvg + Education, 
##     family = "binomial", data = loan_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93531  -0.36823  -0.06004   0.42857   3.03898  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.12642    0.69943 -13.048  < 2e-16 ***
## Income       0.04973    0.00403  12.339  < 2e-16 ***
## Family       0.53235    0.10770   4.943 7.70e-07 ***
## CCAvg        0.25131    0.07360   3.415 0.000639 ***
## Education2   2.38602    0.32627   7.313 2.61e-13 ***
## Education3   2.45780    0.32202   7.632 2.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1064.04  on 767  degrees of freedom
## Residual deviance:  465.06  on 762  degrees of freedom
## AIC: 477.06
## 
## Number of Fisher Scoring iterations: 6

Now our model has gotten better with only significant variables included and lower AIC scores.

Prediction and Evaluation

The test data that we created earlier will be used to test our model.

# Make prediction
loan_predict <- predict(model1, loan_test, type = "response")
loan_label <- ifelse(loan_predict > 0.5,"1","0")
loan_label <- as.factor(loan_label)

And the result can be compared to the data actual values with a confusion matrix to evaluate our model.

# Evaluate the prediction using confusion matrix
loan_conf_logit <- confusionMatrix(loan_label, loan_test$Personal.Loan, positive = "1")
loan_conf_logit
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 78 13
##          1  7 94
##                                           
##                Accuracy : 0.8958          
##                  95% CI : (0.8437, 0.9352)
##     No Information Rate : 0.5573          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7904          
##                                           
##  Mcnemar's Test P-Value : 0.2636          
##                                           
##             Sensitivity : 0.8785          
##             Specificity : 0.9176          
##          Pos Pred Value : 0.9307          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.5573          
##          Detection Rate : 0.4896          
##    Detection Prevalence : 0.5260          
##       Balanced Accuracy : 0.8981          
##                                           
##        'Positive' Class : 1               
## 

Description : - Accuracy: how accurately our model predicts the target class (globally) - Sensitivity / Recall: a measure of the goodness of the model to the positive class - Specificity: a measure of the goodness of the model to the negative class - Pos Pred Value / Precision: how precise the model predicts positive class

Based on the results of the confusion matrix above, we can see that the overall ability of our model to guess the positive class (1) is 89%.

K-Nearest Neighbour

Picking an Optimum K

The optimal K value usually found is the square root of N, where N is the total number of samples.

sqrt(loan_balance %>% nrow())
## [1] 30.98387

The result of the square root is 30,9. So we can either use some closer values to the result such as 30, 31, or 32.

Scaling

head(loan_balance)
##   Age Experience Income Family CCAvg Education Mortgage Personal.Loan
## 1  55         29     53      1   1.4         1        0             0
## 2  33          8     20      3   1.3         1       83             0
## 3  33          6     78      4   2.0         2        0             0
## 4  64         40     52      2   1.1         1      226             0
## 5  33          7     21      1   0.6         3        0             0
## 6  48         22     42      3   0.6         2      121             0

The range of each variable is different so it is necessary to do feature scaling either with min-max normalization or z-score standardization. But, before scaling the variables we have to change the Education column to a numeric data type. To achieve this we can’t change the data type right away, we have to make dummy variables first.

loan_train_dummy <- dummy_cols(loan_train, select_columns = "Education") %>% select(-Education)
loan_test_dummy <- dummy_cols(loan_test, select_columns = "Education") %>% select(-Education)

str(loan_train_dummy)
## 'data.frame':    768 obs. of  10 variables:
##  $ Age          : int  66 47 50 54 45 41 60 49 48 26 ...
##  $ Experience   : int  42 21 25 29 19 16 36 23 22 0 ...
##  $ Income       : int  53 141 38 119 22 113 165 140 71 30 ...
##  $ Family       : int  2 1 1 3 3 1 3 1 1 4 ...
##  $ CCAvg        : num  1.1 2.4 1.3 2 1.5 1 5.6 1.9 1.4 1.3 ...
##  $ Mortgage     : int  0 0 120 0 0 211 0 0 0 0 ...
##  $ Personal.Loan: Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 1 1 ...
##  $ Education_1  : int  1 1 0 1 1 0 1 0 0 0 ...
##  $ Education_2  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Education_3  : int  0 0 0 0 0 1 0 1 1 1 ...
# Scaling the predictors on data train
loan_x_train <- loan_train_dummy %>% 
  select(-Personal.Loan) %>% 
  scale()

# Select only the target variable
loan_y_train <- loan_train_dummy %>% 
  select(Personal.Loan)

After we scaled our train data, we have to use the mean and standard deviation from train data to scaled the test data. We cannot scale them separately.

# Scaling the predictors on test data using mean and standard deviation from train data
loan_x_test <- loan_test_dummy %>% 
  select(-Personal.Loan) %>% 
  scale(center = attr(loan_x_train, "scaled:center") , 
        scale = attr(loan_x_train, "scaled:scale")) # center : mean ; scale : standard deviation

# Select only the target variable
loan_y_test <- loan_test_dummy %>% 
  select(Personal.Loan)

Prediction and Evaluation

Now we can try to use the K-NN method to predict our test data.

# Use KNN to predict test data
loan_knn <- knn(train = loan_x_train, 
                test = loan_x_test,
                cl = loan_y_train$Personal.Loan,
                k = 31)

Like the Logistic Regression method, we can use confusion matrix to evaluate our K-NN method.

# Use confusion matrix to evaluate
loan_conf_knn <- confusionMatrix(loan_knn, loan_y_test$Personal.Loan,  positive = "1")
loan_conf_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 83 18
##          1  2 89
##                                           
##                Accuracy : 0.8958          
##                  95% CI : (0.8437, 0.9352)
##     No Information Rate : 0.5573          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7929          
##                                           
##  Mcnemar's Test P-Value : 0.0007962       
##                                           
##             Sensitivity : 0.8318          
##             Specificity : 0.9765          
##          Pos Pred Value : 0.9780          
##          Neg Pred Value : 0.8218          
##              Prevalence : 0.5573          
##          Detection Rate : 0.4635          
##    Detection Prevalence : 0.4740          
##       Balanced Accuracy : 0.9041          
##                                           
##        'Positive' Class : 1               
## 

Based on the results of the confusion matrix above, we can see that our KNN model has the same Accuracy as our logistic regression model with the overall ability of our model to guess the positive class (1) is 89%.

To see the difference between these methods more clearly we can use a table to compare the scores.

# Evaluation of Logistic Regression Method
tibble(Accuracy = loan_conf_logit$overall[1],
      Recall = loan_conf_logit$byClass[1],
      Specificity = loan_conf_logit$byClass[2],
      Precision = loan_conf_logit$byClass[3])
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.896  0.879       0.918     0.931
# Evaluation of KNN Method
tibble(Accuracy = loan_conf_knn$overall[1],
      Recall = loan_conf_knn$byClass[1],
      Specificity = loan_conf_knn$byClass[2],
      Precision = loan_conf_knn$byClass[3])
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.896  0.832       0.976     0.978

When viewed from the two methods, namely by using Logistics Regression and K-NN, the model’s ability to predict correctly from the actual data of people who buy personal loans is better by using the Logistic Regression method because it has a recall value = 87.8% greater than using the K-NN method.

Conclusion

By using Logistic Regression and K-NN as classification methods, we can see that both methods have a fairly good performance in predicting whether a customer will buy a personal loan or not. If I were a bank, then I would provide as many customers as possible to buy a personal loan, therefore I look at Recall metrics more than precision for possible reachers and use the Logistic Regression as the method.