Exploratory Data Analysis

Column descriptions

*ID : Customer ID

*Age : Customer’s age in completed years

*Experience : Years of professional experience

*Income Annual : income of the customer ($000)

*ZIPCode : Home Address ZIP code.

*Family : Family size of the customer

*CCAvg : Avg. spending on credit cards per month ($000)

*Education : Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional

*Mortgage : Value of house mortgage if any. ($000)

*Personal Loan : Did this customer accept the personal loan offered in the last campaign?

*Securities Account : Does the customer have a securities account with the bank?

*CD : Account Does the customer have a certificate of deposit (CD) account with the bank?

*Online : Does the customer use internet banking facilities?

*CreditCard : Does the customer uses a credit card issued by Universal Bank?

Checking Multicollinearity

Age and experience have strong correlation. The graphic shows very high multicollinearity (1)

Feature Engineering

Because the correlation between Age and Experience is very strong, so I use feature engineering to make a new column from those two variables. Adding column performance that comes from Experience/Age

Re-checking Multicollinearity

Identifying customers who purchase the Personal Loan based on dataset

#>       Age          Experience        Income      Family      CCAvg       
#>  Min.   :26.00   Min.   : 0.00   Min.   : 60.0   1:107   Min.   : 0.000  
#>  1st Qu.:35.00   1st Qu.: 9.00   1st Qu.:122.0   2:106   1st Qu.: 2.600  
#>  Median :45.00   Median :20.00   Median :142.5   3:133   Median : 3.800  
#>  Mean   :45.07   Mean   :19.84   Mean   :144.7   4:134   Mean   : 3.905  
#>  3rd Qu.:55.00   3rd Qu.:30.00   3rd Qu.:172.0           3rd Qu.: 5.348  
#>  Max.   :65.00   Max.   :41.00   Max.   :203.0           Max.   :10.000  
#>  Education    Mortgage     Personal.Loan Securities.Account CD.Account
#>  1: 93     Min.   :  0.0   0:  0         0:420              0:340     
#>  2:182     1st Qu.:  0.0   1:480         1: 60              1:140     
#>  3:205     Median :  0.0                                              
#>            Mean   :100.8                                              
#>            3rd Qu.:192.5                                              
#>            Max.   :617.0                                              
#>  Online  CreditCard
#>  0:189   0:337     
#>  1:291   1:143     
#>                    
#>                    
#>                    
#> 

Customer’s income distribution that purchase the personal loan.

Checking the proportion

#> 
#>     0     1 
#> 0.904 0.096

Modelling

Logistic Regression Model

#> Start:  AIC=2509.14
#> Personal.Loan ~ 1
#> 
#>                      Df Deviance    AIC
#> + Income              1   1591.8 1595.8
#> + CCAvg               1   2100.9 2104.9
#> + CD.Account          1   2297.4 2301.4
#> + Education           2   2423.4 2429.4
#> + Mortgage            1   2430.5 2434.5
#> + Family              3   2483.8 2491.8
#> <none>                    2507.1 2509.1
#> + Securities.Account  1   2506.4 2510.4
#> + Online              1   2506.8 2510.8
#> + performance         1   2507.1 2511.1
#> + CreditCard          1   2507.1 2511.1
#> 
#> Step:  AIC=1595.81
#> Personal.Loan ~ Income
#> 
#>                      Df Deviance    AIC
#> + Education           2   1143.3 1151.3
#> + Family              3   1367.3 1377.3
#> + CD.Account          1   1468.3 1474.3
#> + CCAvg               1   1586.8 1592.8
#> + Mortgage            1   1587.0 1593.0
#> + Securities.Account  1   1589.3 1595.3
#> <none>                    1591.8 1595.8
#> + CreditCard          1   1591.6 1597.6
#> + performance         1   1591.8 1597.8
#> + Online              1   1591.8 1597.8
#> 
#> Step:  AIC=1151.32
#> Personal.Loan ~ Income + Education
#> 
#>                      Df Deviance    AIC
#> + Family              3   1040.9 1054.9
#> + CD.Account          1   1059.0 1069.0
#> + CCAvg               1   1130.0 1140.0
#> + Mortgage            1   1137.8 1147.8
#> <none>                    1143.3 1151.3
#> + Securities.Account  1   1141.9 1151.9
#> + performance         1   1142.6 1152.6
#> + CreditCard          1   1143.2 1153.2
#> + Online              1   1143.3 1153.3
#> 
#> Step:  AIC=1054.89
#> Personal.Loan ~ Income + Education + Family
#> 
#>                      Df Deviance    AIC
#> + CD.Account          1    966.2  982.2
#> + CCAvg               1   1021.8 1037.8
#> + Mortgage            1   1036.8 1052.8
#> + Securities.Account  1   1038.8 1054.8
#> <none>                    1040.9 1054.9
#> + performance         1   1039.8 1055.8
#> + Online              1   1040.8 1056.8
#> + CreditCard          1   1040.8 1056.8
#> 
#> Step:  AIC=982.2
#> Personal.Loan ~ Income + Education + Family + CD.Account
#> 
#>                      Df Deviance    AIC
#> + CCAvg               1   951.50 969.50
#> + CreditCard          1   956.54 974.54
#> + Online              1   960.35 978.35
#> + Securities.Account  1   960.36 978.36
#> + Mortgage            1   963.48 981.48
#> <none>                    966.20 982.20
#> + performance         1   965.77 983.77
#> 
#> Step:  AIC=969.5
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg
#> 
#>                      Df Deviance    AIC
#> + CreditCard          1   942.55 962.55
#> + Securities.Account  1   944.97 964.97
#> + Online              1   945.30 965.30
#> + Mortgage            1   947.53 967.53
#> <none>                    951.50 969.50
#> + performance         1   950.46 970.46
#> 
#> Step:  AIC=962.55
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard
#> 
#>                      Df Deviance    AIC
#> + Securities.Account  1   933.57 955.57
#> + Online              1   933.90 955.90
#> + Mortgage            1   938.97 960.97
#> <none>                    942.55 962.55
#> + performance         1   941.48 963.48
#> 
#> Step:  AIC=955.57
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account
#> 
#>               Df Deviance    AIC
#> + Online       1   923.33 947.33
#> + Mortgage     1   930.14 954.14
#> <none>             933.57 955.57
#> + performance  1   932.60 956.60
#> 
#> Step:  AIC=947.33
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account + Online
#> 
#>               Df Deviance    AIC
#> + Mortgage     1   920.21 946.21
#> <none>             923.33 947.33
#> + performance  1   922.12 948.12
#> 
#> Step:  AIC=946.21
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg + 
#>     CreditCard + Securities.Account + Online + Mortgage
#> 
#>               Df Deviance    AIC
#> <none>             920.21 946.21
#> + performance  1   919.00 947.00
#> 
#> Call:  glm(formula = Personal.Loan ~ Income + Education + Family + CD.Account + 
#>     CCAvg + CreditCard + Securities.Account + Online + Mortgage, 
#>     family = "binomial", data = loans_train)
#> 
#> Coefficients:
#>         (Intercept)               Income           Education2  
#>          -12.663684             0.063204             3.795500  
#>          Education3              Family2              Family3  
#>            3.968236            -0.416268             1.816834  
#>             Family4          CD.Account1                CCAvg  
#>            1.572921             3.463408             0.204684  
#>         CreditCard1  Securities.Account1              Online1  
#>           -0.840227            -1.051212            -0.588341  
#>            Mortgage  
#>            0.001168  
#> 
#> Degrees of Freedom: 3999 Total (i.e. Null);  3987 Residual
#> Null Deviance:       2507 
#> Residual Deviance: 920.2     AIC: 946.2
#> 
#> Call:
#> glm(formula = Personal.Loan ~ Income + Education + CD.Account + 
#>     Family + CreditCard + Online + Securities.Account + CCAvg, 
#>     family = "binomial", data = loans_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1358  -0.1922  -0.0652  -0.0191   4.1434  
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)         -12.60226    0.61275 -20.567  < 2e-16 ***
#> Income                0.06382    0.00344  18.552  < 2e-16 ***
#> Education2            3.76292    0.29882  12.593  < 2e-16 ***
#> Education3            3.92711    0.29842  13.160  < 2e-16 ***
#> CD.Account1           3.50698    0.38136   9.196  < 2e-16 ***
#> Family2              -0.39377    0.25587  -1.539 0.123810    
#> Family3               1.83818    0.27320   6.728 1.72e-11 ***
#> Family4               1.59724    0.26070   6.127 8.96e-10 ***
#> CreditCard1          -0.85130    0.23317  -3.651 0.000261 ***
#> Online1              -0.59567    0.18702  -3.185 0.001447 ** 
#> Securities.Account1  -1.05852    0.34607  -3.059 0.002223 ** 
#> CCAvg                 0.19439    0.05042   3.855 0.000116 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2507.14  on 3999  degrees of freedom
#> Residual deviance:  923.33  on 3988  degrees of freedom
#> AIC: 947.33
#> 
#> Number of Fisher Scoring iterations: 8

In this model, I set the threshold 0.5, so if the probability is higher than 0.5, that means the customer apply for the personal loan

#> pred_label
#>   0   1 
#> 910  90

Evaluate

Logistic Confusion Matrix

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 883  27
#>          1  16  74
#>                                           
#>                Accuracy : 0.957           
#>                  95% CI : (0.9425, 0.9687)
#>     No Information Rate : 0.899           
#>     P-Value [Acc > NIR] : 9.374e-12       
#>                                           
#>                   Kappa : 0.7512          
#>                                           
#>  Mcnemar's Test P-Value : 0.1273          
#>                                           
#>             Sensitivity : 0.7327          
#>             Specificity : 0.9822          
#>          Pos Pred Value : 0.8222          
#>          Neg Pred Value : 0.9703          
#>              Prevalence : 0.1010          
#>          Detection Rate : 0.0740          
#>    Detection Prevalence : 0.0900          
#>       Balanced Accuracy : 0.8574          
#>                                           
#>        'Positive' Class : 1               
#> 

k-NN Confusion Matrix

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 887  91
#>          1  12  10
#>                                           
#>                Accuracy : 0.897           
#>                  95% CI : (0.8765, 0.9151)
#>     No Information Rate : 0.899           
#>     P-Value [Acc > NIR] : 0.6085          
#>                                           
#>                   Kappa : 0.1312          
#>                                           
#>  Mcnemar's Test P-Value : 1.523e-14       
#>                                           
#>             Sensitivity : 0.09901         
#>             Specificity : 0.98665         
#>          Pos Pred Value : 0.45455         
#>          Neg Pred Value : 0.90695         
#>              Prevalence : 0.10100         
#>          Detection Rate : 0.01000         
#>    Detection Prevalence : 0.02200         
#>       Balanced Accuracy : 0.54283         
#>                                           
#>        'Positive' Class : 1               
#> 

Checking AUC value

#> [1] 0.9755724

AUC value is good because the value is 0.97, close to 1.

Conclusion

From logistic regression and k-NN model, I got the accuracy, recall, and precision for both models. Because the Bank want to increase the Personal.Loan’s customer but with minimal budget, so I focus on precision. In the confusion matrix. As the result, the logistic regression predicts better than k-NN model. Logistic regression has 95.7% accuracy and 82.2% precision while k-NN has 89.7% accuracy and 45.4% precision. I choose higher precision to evaluate the prediction of Personal.Loan. It is so unfortunate if I predict someone will purchase loan but in real case they don’t. Another useful insight is when I predict someone don’t purchase the loan but in real case they purchase it. From that it can reduce marketing cost.