library(GGally)
library(dplyr)
library(caret)
library(readr)
library(tidyverse)
library(class)
library(partykit)

Exploratory Data Analysis

loan <- read.csv("data/Bank_Personal_Loan_Modelling.csv")
head(loan)

Column descriptions

*ID : Customer ID

*Age : Customer’s age in completed years

*Experience : Years of professional experience

*Income Annual : income of the customer ($000)

*ZIPCode : Home Address ZIP code.

*Family : Family size of the customer

*CCAvg : Avg. spending on credit cards per month ($000)

*Education : Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional

*Mortgage : Value of house mortgage if any. ($000)

*Personal Loan : Did this customer accept the personal loan offered in the last campaign?

*Securities Account : Does the customer have a securities account with the bank?

*CD : Account Does the customer have a certificate of deposit (CD) account with the bank?

*Online : Does the customer use internet banking facilities?

*CreditCard : Does the customer uses a credit card issued by Universal Bank?

loans <- loan %>% 
  mutate(Personal.Loan = as.factor(Personal.Loan), Securities.Account = as.factor(Securities.Account),
         CD.Account = as.factor(CD.Account), Online = as.factor(Online), Family = as.factor(Family),
         CreditCard = as.factor(CreditCard), Education = as.factor(Education)) %>% 
  select(- c(ID, ZIP.Code))

head(loans)

Checking Multicollinearity

ggcorr(loans, label = T)

Age and experience have strong correlation. The graphic shows very high multicollinearity (1)

plot(loan$Age, loan$Experience, xlab = "Age", ylab = "Experience")

Feature Engineering

Because the correlation between Age and Experience is very strong, so I use feature engineering to make a new column from those two variables. Adding column performance that comes from Experience/Age

loans_ed <- loans %>% 
  mutate(performance = Experience/Age) %>% 
  select(- c(Age, Experience))

Re-checking Multicollinearity

ggcorr(loans_ed, label = T)

Identifying customers who purchase the Personal Loan based on dataset

loans_1 <- loans %>% 
  filter(Personal.Loan == 1) 

summary(loans_1)

#>       Age          Experience        Income      Family      CCAvg       
#>  Min.   :26.00   Min.   : 0.00   Min.   : 60.0   1:107   Min.   : 0.000  
#>  1st Qu.:35.00   1st Qu.: 9.00   1st Qu.:122.0   2:106   1st Qu.: 2.600  
#>  Median :45.00   Median :20.00   Median :142.5   3:133   Median : 3.800  
#>  Mean   :45.07   Mean   :19.84   Mean   :144.7   4:134   Mean   : 3.905  
#>  3rd Qu.:55.00   3rd Qu.:30.00   3rd Qu.:172.0           3rd Qu.: 5.348  
#>  Max.   :65.00   Max.   :41.00   Max.   :203.0           Max.   :10.000  
#>  Education    Mortgage     Personal.Loan Securities.Account CD.Account
#>  1: 93     Min.   :  0.0   0:  0         0:420              0:340     
#>  2:182     1st Qu.:  0.0   1:480         1: 60              1:140     
#>  3:205     Median :  0.0                                              
#>            Mean   :100.8                                              
#>            3rd Qu.:192.5                                              
#>            Max.   :617.0                                              
#>  Online  CreditCard
#>  0:189   0:337     
#>  1:291   1:143     
#>                    
#>                    
#>                    
#>

Customer’s income distribution that purchase the personal loan.

hist(loans_1$Income, breaks = 100, xlab = "Income")

Checking the proportion

prop.table(table(loans_ed$Personal.Loan))

#> 
#>     0     1 
#> 0.904 0.096

Cross Validation

set.seed(90)

idx <- sample(nrow(loans_ed), nrow(loans_ed)*0.8)

loans_train <- loans_ed[idx,]
loans_test <- loans_ed[-idx,]

logistic.train_label <- loans_ed[idx, "Personal.Loan"]
logistic.test_label <- loans_ed[-idx, "Personal.Loan"]

Checking the proportion for data train and data test

prop.table(table(loans_train$Personal.Loan))

#> 
#>       0       1 
#> 0.90525 0.09475

prop.table(table(loans_test$Personal.Loan))

#> 
#>     0     1 
#> 0.899 0.101

Modelling

Logistic Regression Model

loans_none <- glm(formula = Personal.Loan ~ 1, data = loans_train, family = "binomial")
loans_all <- glm(formula = Personal.Loan ~ ., data = loans_train, family = "binomial")

step(object = loans_none, list(lower = loans_none, upper = loans_all), direction = "forward")

#> Start:  AIC=2509.14
#> Personal.Loan ~ 1
#> 
#>                      Df Deviance    AIC
#> + Income              1   1591.8 1595.8
#> + CCAvg               1   2100.9 2104.9
#> + CD.Account          1   2297.4 2301.4
#> + Education           2   2423.4 2429.4
#> + Mortgage            1   2430.5 2434.5
#> + Family              3   2483.8 2491.8
#> <none>                    2507.1 2509.1
#> + Securities.Account  1   2506.4 2510.4
#> + Online              1   2506.8 2510.8
#> + performance         1   2507.1 2511.1
#> + CreditCard          1   2507.1 2511.1
#> 
#> Step:  AIC=1595.81
#> Personal.Loan ~ Income
#> 
#>                      Df Deviance    AIC
#> + Education           2   1143.3 1151.3
#> + Family              3   1367.3 1377.3
#> + CD.Account          1   1468.3 1474.3
#> + CCAvg               1   1586.8 1592.8
#> + Mortgage            1   1587.0 1593.0
#> + Securities.Account  1   1589.3 1595.3
#> <none>                    1591.8 1595.8
#> + CreditCard          1   1591.6 1597.6
#> + performance         1   1591.8 1597.8
#> + Online              1   1591.8 1597.8
#> 
#> Step:  AIC=1151.32
#> Personal.Loan ~ Income + Education
#> 
#>                      Df Deviance    AIC
#> + Family              3   1040.9 1054.9
#> + CD.Account          1   1059.0 1069.0
#> + CCAvg               1   1130.0 1140.0
#> + Mortgage            1   1137.8 1147.8
#> <none>                    1143.3 1151.3
#> + Securities.Account  1   1141.9 1151.9
#> + performance         1   1142.6 1152.6
#> + CreditCard          1   1143.2 1153.2
#> + Online              1   1143.3 1153.3
#> 
#> Step:  AIC=1054.89
#> Personal.Loan ~ Income + Education + Family
#> 
#>                      Df Deviance    AIC
#> + CD.Account          1    966.2  982.2
#> + CCAvg               1   1021.8 1037.8
#> + Mortgage            1   1036.8 1052.8
#> + Securities.Account  1   1038.8 1054.8
#> <none>                    1040.9 1054.9
#> + performance         1   1039.8 1055.8
#> + Online              1   1040.8 1056.8
#> + CreditCard          1   1040.8 1056.8
#> 
#> Step:  AIC=982.2
#> Personal.Loan ~ Income + Education + Family + CD.Account
#> 
#>                      Df Deviance    AIC
#> + CCAvg               1   951.50 969.50
#> + CreditCard          1   956.54 974.54
#> + Online              1   960.35 978.35
#> + Securities.Account  1   960.36 978.36
#> + Mortgage            1   963.48 981.48
#> <none>                    966.20 982.20
#> + performance         1   965.77 983.77
#> 
#> Step:  AIC=969.5
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg
#> 
#>                      Df Deviance    AIC
#> + CreditCard          1   942.55 962.55
#> + Securities.Account  1   944.97 964.97
#> + Online              1   945.30 965.30
#> + Mortgage            1   947.53 967.53
#> <none>                    951.50 969.50
#> + performance         1   950.46 970.46
#> 
#> Step:  AIC=962.55
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard
#> 
#>                      Df Deviance    AIC
#> + Securities.Account  1   933.57 955.57
#> + Online              1   933.90 955.90
#> + Mortgage            1   938.97 960.97
#> <none>                    942.55 962.55
#> + performance         1   941.48 963.48
#> 
#> Step:  AIC=955.57
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account
#> 
#>               Df Deviance    AIC
#> + Online       1   923.33 947.33
#> + Mortgage     1   930.14 954.14
#> <none>             933.57 955.57
#> + performance  1   932.60 956.60
#> 
#> Step:  AIC=947.33
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account + Online
#> 
#>               Df Deviance    AIC
#> + Mortgage     1   920.21 946.21
#> <none>             923.33 947.33
#> + performance  1   922.12 948.12
#> 
#> Step:  AIC=946.21
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account + Online + Mortgage
#> 
#>               Df Deviance    AIC
#> <none>             920.21 946.21
#> + performance  1   919.00 947.00

#> 
#> Call:  glm(formula = Personal.Loan ~ Income + Education + Family + CD.Account + 
#>     CCAvg + CreditCard + Securities.Account + Online + Mortgage, 
#>     family = "binomial", data = loans_train)
#> 
#> Coefficients:
#>         (Intercept)               Income           Education2  
#>          -12.663684             0.063204             3.795500  
#>          Education3              Family2              Family3  
#>            3.968236            -0.416268             1.816834  
#>             Family4          CD.Account1                CCAvg  
#>            1.572921             3.463408             0.204684  
#>         CreditCard1  Securities.Account1              Online1  
#>           -0.840227            -1.051212            -0.588341  
#>            Mortgage  
#>            0.001168  
#> 
#> Degrees of Freedom: 3999 Total (i.e. Null);  3987 Residual
#> Null Deviance:       2507 
#> Residual Deviance: 920.2     AIC: 946.2

model_logistic <- glm(formula = Personal.Loan ~ Income + Education + CD.Account + 
    Family + CreditCard + Online + Securities.Account + CCAvg, 
    family = "binomial", data = loans_train)

model_logistic %>% 
  summary()

#> 
#> Call:
#> glm(formula = Personal.Loan ~ Income + Education + CD.Account + 
#>     Family + CreditCard + Online + Securities.Account + CCAvg, 
#>     family = "binomial", data = loans_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1358  -0.1922  -0.0652  -0.0191   4.1434  
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)         -12.60226    0.61275 -20.567  < 2e-16 ***
#> Income                0.06382    0.00344  18.552  < 2e-16 ***
#> Education2            3.76292    0.29882  12.593  < 2e-16 ***
#> Education3            3.92711    0.29842  13.160  < 2e-16 ***
#> CD.Account1           3.50698    0.38136   9.196  < 2e-16 ***
#> Family2              -0.39377    0.25587  -1.539 0.123810    
#> Family3               1.83818    0.27320   6.728 1.72e-11 ***
#> Family4               1.59724    0.26070   6.127 8.96e-10 ***
#> CreditCard1          -0.85130    0.23317  -3.651 0.000261 ***
#> Online1              -0.59567    0.18702  -3.185 0.001447 ** 
#> Securities.Account1  -1.05852    0.34607  -3.059 0.002223 ** 
#> CCAvg                 0.19439    0.05042   3.855 0.000116 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2507.14  on 3999  degrees of freedom
#> Residual deviance:  923.33  on 3988  degrees of freedom
#> AIC: 947.33
#> 
#> Number of Fisher Scoring iterations: 8

In this model, I set the threshold 0.5, so if the probability is higher than 0.5, that means the customer apply for the personal loan

pred <- predict(model_logistic, loans_test, type = "response")
pred_label <-  as.factor(if_else(pred < 0.5, 0, 1))

table(pred_label)

#> pred_label
#>   0   1 
#> 910  90

hist(predict(model_logistic, loans_test, type = "response"), breaks = 100,
     xlab = "Logistic Model Probability")

k-NN Model

train_label <- loans_train$Personal.Loan
test_label <- loans_test$Personal.Loan

Scaling data train and data test

train_knn <- loans_train %>% 
  select_if(is.numeric) %>% 
  scale

test_knn <- loans_test %>% 
  select_if(is.numeric) %>% 
  scale(center = attr(train_knn, "scaled:center"),
        scale = attr(train_knn, "scaled:scale"))

Determine k value

sqrt(nrow(train_knn))

#> [1] 63.24555

model_knn <- knn(train = train_knn, test = test_knn, cl = train_label, k = 63)

Evaluate

Logistic Confusion Matrix

confusionMatrix(pred_label, loans_test$Personal.Loan, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 883  27
#>          1  16  74
#>                                           
#>                Accuracy : 0.957           
#>                  95% CI : (0.9425, 0.9687)
#>     No Information Rate : 0.899           
#>     P-Value [Acc > NIR] : 9.374e-12       
#>                                           
#>                   Kappa : 0.7512          
#>                                           
#>  Mcnemar's Test P-Value : 0.1273          
#>                                           
#>             Sensitivity : 0.7327          
#>             Specificity : 0.9822          
#>          Pos Pred Value : 0.8222          
#>          Neg Pred Value : 0.9703          
#>              Prevalence : 0.1010          
#>          Detection Rate : 0.0740          
#>    Detection Prevalence : 0.0900          
#>       Balanced Accuracy : 0.8574          
#>                                           
#>        'Positive' Class : 1               
#>

k-NN Confusion Matrix

confusionMatrix(model_knn, test_label, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 887  91
#>          1  12  10
#>                                           
#>                Accuracy : 0.897           
#>                  95% CI : (0.8765, 0.9151)
#>     No Information Rate : 0.899           
#>     P-Value [Acc > NIR] : 0.6085          
#>                                           
#>                   Kappa : 0.1312          
#>                                           
#>  Mcnemar's Test P-Value : 1.523e-14       
#>                                           
#>             Sensitivity : 0.09901         
#>             Specificity : 0.98665         
#>          Pos Pred Value : 0.45455         
#>          Neg Pred Value : 0.90695         
#>              Prevalence : 0.10100         
#>          Detection Rate : 0.01000         
#>    Detection Prevalence : 0.02200         
#>       Balanced Accuracy : 0.54283         
#>                                           
#>        'Positive' Class : 1               
#>

Checking ROC

library(ROCR)
pred_prob <- predict(object = model_logistic, newdata = loans_test, type = "response")
pred.logistic <- prediction(pred_prob, labels = logistic.test_label)
perf <- performance(prediction.obj = pred.logistic, measure = "tpr", "fpr")
plot(perf)

Checking AUC value

auc <- performance(pred.logistic ,measure = "auc")
auc@y.values[[1]]

#> [1] 0.9755724

AUC value is good because the value is 0.97, close to 1.

Conclusion

From logistic regression and k-NN model, I got the accuracy, recall, and precision for both models. Because the Bank want to increase the Personal.Loan’s customer but with minimal budget, so I focus on precision. In the confusion matrix. As the result, the logistic regression predicts better than k-NN model. Logistic regression has 95.7% accuracy and 82.2% precision while k-NN has 89.7% accuracy and 45.4% precision. I choose higher precision to evaluate the prediction of Personal.Loan. It is so unfortunate if I predict someone will purchase loan but in real case they don’t. Another useful insight is when I predict someone don’t purchase the loan but in real case they purchase it. From that it can reduce marketing cost.

Bank Loan Modelling

Toho Dustin

November 14, 2019