Loan Prediction Using Logistic Regression and K-Nearest Neighbor

Introduction

In this case study, I will examine loan prediction data, look for any indicators that can be used to predict approval or disapproval (Loan Status), and want to automate the loan eligibility process (in real-time) based on the customer details provided when filling out the online application form. Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and other details are included.

The Banking industry employs the most analytical methods and data science of any industry. This data set will provide me with enough flavor to work on data sets from banks, such as what challenges they face and what strategies they employ, among other things. This is a classification situation. The data consists of 500 rows and 15 features that can be used to predict whether or not a loan will be approved.

On this occasion, I will try to make predictions with the algorithm that I will used Logistic Regression and K-Nearest Neighbor which is included in supervised learning.

Data Preparation

Import data:

loan <- read.csv("data_input/df1_loan.csv", stringsAsFactors = T)
loan

I’ll remove the $ symbol from the total income column.

loan$Total_Income <-  sub("([.])|[[:punct:]]", "\\1", as.matrix(loan$Total_Income))
loan$Total_Income <- as.numeric(loan$Total_Income)
loan

Data Cleansing

Check data types :

str(loan)

#> 'data.frame':    500 obs. of  15 variables:
#>  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ Loan_ID          : Factor w/ 500 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#>  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#>  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#>  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#>  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
#>  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
#>  $ LoanAmount       : num  NA 128 66 120 141 267 95 158 168 349 ...
#>  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History   : num  1 1 1 1 1 1 1 0 1 1 ...
#>  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#>  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#>  $ Total_Income     : num  5849 6091 3000 4941 6000 ...

Change the data type and remove any unnecessary columns:

library(dplyr)
loan_clean <- loan %>% 
  select(-c(X, Loan_ID)) %>% 
  mutate(Credit_History = as.factor(Credit_History))
loan_clean

str(loan_clean)

#> 'data.frame':    500 obs. of  13 variables:
#>  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#>  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#>  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#>  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#>  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
#>  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
#>  $ LoanAmount       : num  NA 128 66 120 141 267 95 158 168 349 ...
#>  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
#>  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#>  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#>  $ Total_Income     : num  5849 6091 3000 4941 6000 ...

Check missing value:

colSums(is.na(loan_clean))

#>            Gender           Married        Dependents         Education 
#>                 0                 0                 0                 0 
#>     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
#>                 0                 0                 0                18 
#>  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
#>                14                41                 0                 0 
#>      Total_Income 
#>                 0

Remove any rows that have missing values:

loan_clean <- loan_clean %>% 
  filter(complete.cases(.)) 
colSums(is.na(loan_clean))

#>            Gender           Married        Dependents         Education 
#>                 0                 0                 0                 0 
#>     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
#>                 0                 0                 0                 0 
#>  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
#>                 0                 0                 0                 0 
#>      Total_Income 
#>                 0

Check table propotion:

prop.table(table(loan_clean$Loan_Status))

#> 
#>         N         Y 
#> 0.3107477 0.6892523

Check Summary Data:

summary(loan_clean)

#>     Gender    Married   Dependents        Education   Self_Employed
#>        :  8      :  2     :  9     Graduate    :342      : 21      
#>  Female: 77   No :154   0 :245     Not Graduate: 86   No :352      
#>  Male  :343   Yes:272   1 : 68                        Yes: 55      
#>                         2 : 71                                     
#>                         3+: 35                                     
#>                                                                    
#>  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
#>  Min.   :  150   Min.   :    0     Min.   : 17.0   Min.   : 36.0   
#>  1st Qu.: 2880   1st Qu.:    0     1st Qu.:100.0   1st Qu.:360.0   
#>  Median : 3863   Median : 1062     Median :127.5   Median :360.0   
#>  Mean   : 5627   Mean   : 1503     Mean   :144.0   Mean   :342.8   
#>  3rd Qu.: 5818   3rd Qu.: 2212     3rd Qu.:162.0   3rd Qu.:360.0   
#>  Max.   :81000   Max.   :20000     Max.   :700.0   Max.   :480.0   
#>  Credit_History   Property_Area Loan_Status  Total_Income  
#>  0: 63          Rural    :123   N:133       Min.   : 1442  
#>  1:365          Semiurban:167   Y:295       1st Qu.: 4166  
#>                 Urban    :138               Median : 5274  
#>                                             Mean   : 7131  
#>                                             3rd Qu.: 7544  
#>                                             Max.   :81000

Logistic Regression

Cross Validation

The purpose of cross validation is to determine how well our model works

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_loan <- sample(nrow(loan_clean), nrow(loan_clean)*0.8)
loan_train <- loan_clean[index_loan,]
loan_test <- loan_clean[-index_loan,]

Check table propotion:

loan_train$Loan_Status %>% 
  table() %>% 
  prop.table()

#> .
#>         N         Y 
#> 0.3157895 0.6842105

Build Model

Created a model based on business knowledge:

loan_risk <- glm(formula = Loan_Status ~ . , data = loan_train, family = "binomial")

# using the stepwise method
loan_risk_back <- step(object = loan_risk, direction="backward", trace = 0)
summary(loan_risk_back)

#> 
#> Call:
#> glm(formula = Loan_Status ~ Loan_Amount_Term + Credit_History + 
#>     Property_Area, family = "binomial", data = loan_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7052  -0.3290   0.5103   0.7926   2.4264  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)            -1.336430   1.042366  -1.282 0.199803    
#> Loan_Amount_Term       -0.005124   0.002707  -1.892 0.058435 .  
#> Credit_History1         3.886297   0.561857   6.917 4.62e-12 ***
#> Property_AreaSemiurban  1.267555   0.359440   3.526 0.000421 ***
#> Property_AreaUrban      0.291414   0.333679   0.873 0.382480    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 426.58  on 341  degrees of freedom
#> Residual deviance: 314.98  on 337  degrees of freedom
#> AIC: 324.98
#> 
#> Number of Fisher Scoring iterations: 5

Predict

Predict data test using a built-in model: predict(model, newdata, type)

library(class)
loan_pred <- predict(object = loan_risk_back, newdata = loan_test, type = "response")

loan_pred_label <- as.factor(ifelse(loan_pred > 0.5, "Y", "N"))

table(loan_pred_label)

#> loan_pred_label
#>  N  Y 
#> 14 72

table(predict = loan_pred_label, 
      actual = loan_test$Loan_Status)

#>        actual
#> predict  N  Y
#>       N 12  2
#>       Y 13 59

Model Evaluation

library(caret)
CM_LR <- confusionMatrix(data = loan_pred_label, reference = loan_test$Loan_Status, positive = "Y")
CM_LR

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  N  Y
#>          N 12  2
#>          Y 13 59
#>                                          
#>                Accuracy : 0.8256         
#>                  95% CI : (0.7287, 0.899)
#>     No Information Rate : 0.7093         
#>     P-Value [Acc > NIR] : 0.009537       
#>                                          
#>                   Kappa : 0.5139         
#>                                          
#>  Mcnemar's Test P-Value : 0.009823       
#>                                          
#>             Sensitivity : 0.9672         
#>             Specificity : 0.4800         
#>          Pos Pred Value : 0.8194         
#>          Neg Pred Value : 0.8571         
#>              Prevalence : 0.7093         
#>          Detection Rate : 0.6860         
#>    Detection Prevalence : 0.8372         
#>       Balanced Accuracy : 0.7236         
#>                                          
#>        'Positive' Class : Y              
#>

Accuracy: how well our model predicts the target class (globally)
Sensitivity/Recall: a measure of the model’s goodness to the ‘positive’ class (reference is actual data)
Specificity: a measure of the model’s resiliency to the ‘negative.’
Post Pred Value/Precision: how well the model predicts the positive class.

Based on the confusion matrix results, we can conclude that the model’s ability to predict the target Y (Loan Status Approved or Not) is 82.5%. Meanwhile, based on actual data from people whose loan status was not approved, the model was able to determine correctly 48% of the cases. The model was able to correctly predict 96.7% of the time from all actual data of people who received approved loan status. The model correctly guessed the positive class 81.9% of the time based on the overall prediction results it was able to guess.

K-nearest neighbor (K-NN)

Cross-Validation

set.seed(100)
index_loan <- sample(nrow(loan_clean), nrow(loan_clean)*0.8)
loan_train <- loan_clean[index_loan,]
loan_test <- loan_clean[-index_loan,]

Before we can scale, we must first separate predictors and targets:

# prediktor data train
train_x <- loan_train %>% 
            select_if(is.numeric) 

#target data train
train_y <- loan_train %>% 
           select(Loan_Status) 

# prediktor data test
test_x <- loan_test %>% 
  select_if(is.numeric)

test_y <- loan_test %>% 
  select(Loan_Status)

Scaling for train and test data:

train_x_scaled <- scale(train_x)
test_x_scaled <- scale(test_x,
                center = attr(train_x_scaled,"scaled:center"), 
                scale = attr(train_x_scaled, "scaled:scale"))

Selection of the value of k from the observation’s root (data train):

train_x %>% 
  nrow() %>% 
  sqrt()

#> [1] 18.49324

K = 18

use the knn() function from the package class to model

train : data predictor of train data
test : predictor data from test data
cl : target data from train data
k : the number of k you want to use

library(class)
loan_pred <-  knn(train = train_x_scaled,
                  test = test_x_scaled, 
                  cl = train_y$Loan_Status,
                  k = 18)
loan_pred

#>  [1] Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
#> [39] Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y N Y N
#> [77] Y Y Y Y Y Y Y Y Y Y
#> Levels: N Y

Evaluation Model

library(caret) 
CM_KNN <- confusionMatrix(data = loan_pred, reference = test_y$Loan_Status, positive = "Y")
CM_KNN

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  N  Y
#>          N  4  0
#>          Y 21 61
#>                                          
#>                Accuracy : 0.7558         
#>                  95% CI : (0.6513, 0.842)
#>     No Information Rate : 0.7093         
#>     P-Value [Acc > NIR] : 0.2044         
#>                                          
#>                   Kappa : 0.2127         
#>                                          
#>  Mcnemar's Test P-Value : 1.275e-05      
#>                                          
#>             Sensitivity : 1.0000         
#>             Specificity : 0.1600         
#>          Pos Pred Value : 0.7439         
#>          Neg Pred Value : 1.0000         
#>              Prevalence : 0.7093         
#>          Detection Rate : 0.7093         
#>    Detection Prevalence : 0.9535         
#>       Balanced Accuracy : 0.5800         
#>                                          
#>        'Positive' Class : Y              
#>

Accuracy: how well our model predicts the target class (globally)
Sensitivity/Recall: a measure of the model’s goodness to the ‘positive’ class (reference is actual data)
Specificity: a measure of the model’s resiliency to the ‘negative.’
Post Pred Value/Precision: how well the model predicts the positive class.

Based on the confusion matrix results, we can conclude that the model’s ability to predict the target Y (Loan Status Approved or Not) is 75.5%. Meanwhile, based on actual data from people whose loan status was not approved, the model was able to determine correctly 16% of the cases. The model was able to correctly predict 100% of the time from all actual data of people who received approved loan status. The model correctly guessed the positive class 74.3% of the time based on the overall prediction results it was able to guess.

Conclusion

In the case, treatment will I give to prospective customers to predict whether their loan will be approved or denied? As a result, I’ll use the recall metric, because I don’t want my model to predict customers who are likely to be disapproved but predict that it will harm the company and add risk.

When comparing with two models, the Logistic Regression and K-Nearest Neighbor model, the model of being able to correctly predict the actual data of approved customers is better using the K-Nearest Neighbor model because it has a recall value that has 100% , compared to the logistic regression model that has 96.7%. So, by using this model, companies can reduce the risk of lending.