Project Description

A Bank Manager is tasked by the Company to classify whether the Bank’s customers are eligible for a loan or not when they first fill in their online application form. This is to specifically target the potential customers for further probing by the Bank for loan application.

To do that, we must help the Bank Manager by conducting a classification method using logistic regression and K-Nearest Neighbor (K-NN). The dataset that we use are derived from: https://www.kaggle.com/datasets/vipin20/loan-application-data.

# load all libraries for further usage

library(data.table)
library(dplyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(tidyr)
library(gtools)
library(fastDummies)

Reading and Cleaning the Dataset

loan_class <- read.csv("datainput/df1_loan.csv", stringsAsFactors = T)

glimpse(loan_class)
#> Rows: 500
#> Columns: 15
#> $ X                 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
#> $ Loan_ID           <fct> LP001002, LP001003, LP001005, LP001006, LP001008, LP~
#> $ Gender            <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male~
#> $ Married           <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,~
#> $ Dependents        <fct> 0, 1, 0, 0, 0, 2, 0, 3+, 2, 1, 2, 2, 2, 0, 2, 0, 1, ~
#> $ Education         <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate~
#> $ Self_Employed     <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, , No, ~
#> $ ApplicantIncome   <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006~
#> $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, ~
#> $ LoanAmount        <dbl> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1~
#> $ Loan_Amount_Term  <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36~
#> $ Credit_History    <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, ~
#> $ Property_Area     <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem~
#> $ Loan_Status       <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N~
#> $ Total_Income      <fct> $5849.0, $6091.0, $3000.0, $4941.0, $6000.0, $9613.0~

It seems that: - Total_Income variable is not considered as numerical values due to it using dollar sign ($). Let’s change it by first deleting the column and creating a new one by summing up ApplicantIncome and CoapplicantIncome variables so that it has numerical values. - Credit_History is still considered as numerical. Let’s change it to factor. - Let’s also remove unnecessary columns such as X and Loan_ID columns. We will also delete ApplicantIncome and CoapplicantIncome to avoid reduncancy with Total_Income which can be used as a proxy for Income.

loan_class$Credit_History <- as.factor(loan_class$Credit_History)
loan_class1 <- loan_class[-c(1:2,15)]
loan_class1$Total_Income <- rowSums(loan_class1[,c("ApplicantIncome", "CoapplicantIncome")])
loan_class2 <- loan_class1[-c(6:7)]

str(loan_class2)
#> 'data.frame':    500 obs. of  11 variables:
#>  $ Gender          : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married         : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#>  $ Dependents      : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#>  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#>  $ Self_Employed   : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#>  $ LoanAmount      : num  NA 128 66 120 141 267 95 158 168 349 ...
#>  $ Loan_Amount_Term: num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
#>  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#>  $ Loan_Status     : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#>  $ Total_Income    : num  5849 6091 3000 4941 6000 ...

Based on the information that we got from the dataset, the following are the variables description:

  • Gender: Gender of applicant male or female
  • Married: Married Status! Yes or no
  • Dependents: Dependents of applicant
  • Education: Education, Graduate or Not Graduate
  • Self_Employed: Self_Employed! Yes or No
  • LoanAmount: Loan Amount for loan
  • Loan_Amount_Term: Loan Amount Term (the length of time it will take for a loan to be completely paid off when the borrower is making regular payments)
  • Credit_History: Credit History (Has a credit history = 1, No credit history = 0)
  • Property_Area: Property Area
  • Loan_Status: Approved Loan (Yes = loan approved, No = loan not approved)
  • Total_Income: Total Income in a household of Applicant

Next, we need to check whether our dataset has any missing value or not.

anyNA(loan_class2)
#> [1] TRUE
colSums(is.na(loan_class2))
#>           Gender          Married       Dependents        Education 
#>                0                0                0                0 
#>    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
#>                0               18               14               41 
#>    Property_Area      Loan_Status     Total_Income 
#>                0                0                0

Since the data has several missing values, let’s exclude those data that have missing values.

loan_class_clean <- na.omit(loan_class2)
anyNA(loan_class_clean)
#> [1] FALSE
colSums(is.na(loan_class_clean))
#>           Gender          Married       Dependents        Education 
#>                0                0                0                0 
#>    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
#>                0                0                0                0 
#>    Property_Area      Loan_Status     Total_Income 
#>                0                0                0

Now let’s see the data to see whether there is any outlier or not, to make sure that our subsequent analysis is not biased towards to outlier.

summary(loan_class_clean)
#>     Gender    Married   Dependents        Education   Self_Employed
#>        :  8      :  2     :  9     Graduate    :342      : 21      
#>  Female: 77   No :154   0 :245     Not Graduate: 86   No :352      
#>  Male  :343   Yes:272   1 : 68                        Yes: 55      
#>                         2 : 71                                     
#>                         3+: 35                                     
#>                                                                    
#>    LoanAmount    Loan_Amount_Term Credit_History   Property_Area Loan_Status
#>  Min.   : 17.0   Min.   : 36.0    0: 63          Rural    :123   N:133      
#>  1st Qu.:100.0   1st Qu.:360.0    1:365          Semiurban:167   Y:295      
#>  Median :127.5   Median :360.0                   Urban    :138              
#>  Mean   :144.0   Mean   :342.8                                              
#>  3rd Qu.:162.0   3rd Qu.:360.0                                              
#>  Max.   :700.0   Max.   :480.0                                              
#>   Total_Income  
#>  Min.   : 1442  
#>  1st Qu.: 4166  
#>  Median : 5274  
#>  Mean   : 7131  
#>  3rd Qu.: 7544  
#>  Max.   :81000
boxplot(loan_class_clean)

It can be seen from the above boxplot, that the variable of Total_Income has the most outliers.

Let’s confirm it by using another boxplot and histogram graphic.

boxplot(loan_class_clean$Total_Income)

hist(loan_class_clean$Total_Income, breaks = 50)

To make sure that our data is not biased from the outliers, let’s clean this up.

# filter
loan_class_nooutlier <- loan_class_clean[loan_class_clean$Total_Income < 8000,]

summary(loan_class_nooutlier)
#>     Gender    Married   Dependents        Education   Self_Employed
#>        :  4      :  2     :  8     Graduate    :247      : 19      
#>  Female: 67   No :121   0 :197     Not Graduate: 84   No :276      
#>  Male  :260   Yes:208   1 : 49                        Yes: 36      
#>                         2 : 54                                     
#>                         3+: 23                                     
#>                                                                    
#>    LoanAmount    Loan_Amount_Term Credit_History   Property_Area Loan_Status
#>  Min.   : 17.0   Min.   : 36.0    0: 50          Rural    : 96   N:102      
#>  1st Qu.: 96.0   1st Qu.:360.0    1:281          Semiurban:127   Y:229      
#>  Median :118.0   Median :360.0                   Urban    :108              
#>  Mean   :117.7   Mean   :345.3                                              
#>  3rd Qu.:137.5   3rd Qu.:360.0                                              
#>  Max.   :275.0   Max.   :480.0                                              
#>   Total_Income 
#>  Min.   :1442  
#>  1st Qu.:3791  
#>  Median :4727  
#>  Mean   :4819  
#>  3rd Qu.:5733  
#>  Max.   :7978

Just to be safe, let’s see again whether we have really removed the outliers on our dataset.

boxplot(loan_class_nooutlier$Total_Income)

hist(loan_class_nooutlier$Total_Income, breaks = 50)

Based on the boxplot and histogram above, it’s safe to say that our dataset of loan_class_nooutlier is considerably free from outliers.

GREAT! Let’s proceed with pre-processing our Data

Pre-Processing Data

Before we start the modelling, we need to check the proportion of the target variable, which is Loan_Status

# Check proportion of target variable
prop.table(table(loan_class_nooutlier$Loan_Status)) %>% round(2)
#> 
#>    N    Y 
#> 0.31 0.69
table(loan_class_nooutlier$Loan_Status)
#> 
#>   N   Y 
#> 102 229

Since the ratio of the data is close to 1:2, we can consider that the data is considerably balanced. Therefore, we can proceed to the next step.

Splitting

The next step that we should do is to split or seperate the train data and test data. The train data will be used for modelling, while the test data will be used to test our model when facing unseen data. The data will be split with a ratio of 80 over 20 (80% for the train data, and 20% for the test data)

RNGkind(sample.kind = "Rounding")
set.seed(123)

index <- sample(x = nrow(loan_class_nooutlier), 
                size = nrow(loan_class_nooutlier)*0.8)

#splitting
loan_train <- loan_class_nooutlier[index,]
loan_test <-  loan_class_nooutlier[-index,]

Just to be on the safe side, let’s confirm whether loan_train is indeed 80% of the loan_class_clean

nrow(loan_train)
#> [1] 264
nrow(loan_class_nooutlier)*0.8
#> [1] 264.8

Yep, more or less the number is similar. Now, let’s check the proportion of the target variable in the train data.

# recheck class balance
prop.table(table(loan_train$Loan_Status))
#> 
#>         N         Y 
#> 0.3068182 0.6931818

The class balance is similar to the loan_class_clean dataset. That means, it’s good to go~!

Logistic Regression

Build Model

Logistic Regression Model is built by using the function of glm(). However, since we want to make sure that we develop the best fit model, we will also do a feature selection using step-wise regression process using both backward and forward model.

To be able to do a both-step regression process, we will need to set the upper and lower threshold of the model by creating model_none and model_all first.

model_none <- glm(formula = Loan_Status ~ 1, data = loan_train, family = binomial)
model_all <- glm(formula = Loan_Status~., data = loan_train, family = binomial)

summary(model_all)
#> 
#> Call:
#> glm(formula = Loan_Status ~ ., family = binomial, data = loan_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2305  -0.3414   0.3899   0.6598   2.3865  
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)             1.221e+01  1.018e+03   0.012  0.99044    
#> GenderFemale            8.635e-01  1.466e+00   0.589  0.55579    
#> GenderMale              1.219e+00  1.405e+00   0.868  0.38556    
#> MarriedNo              -1.319e+01  1.018e+03  -0.013  0.98967    
#> MarriedYes             -1.313e+01  1.018e+03  -0.013  0.98971    
#> Dependents0            -1.097e+00  1.743e+00  -0.630  0.52898    
#> Dependents1            -1.908e+00  1.791e+00  -1.065  0.28680    
#> Dependents2            -1.019e+00  1.788e+00  -0.570  0.56863    
#> Dependents3+           -1.046e+00  1.831e+00  -0.571  0.56770    
#> EducationNot Graduate  -5.373e-01  3.894e-01  -1.380  0.16761    
#> Self_EmployedNo        -2.984e-02  7.558e-01  -0.039  0.96851    
#> Self_EmployedYes       -2.702e-01  9.074e-01  -0.298  0.76587    
#> LoanAmount             -5.070e-03  5.537e-03  -0.916  0.35983    
#> Loan_Amount_Term       -6.502e-03  3.798e-03  -1.712  0.08688 .  
#> Credit_History1         4.045e+00  6.107e-01   6.624 3.49e-11 ***
#> Property_AreaSemiurban  1.326e+00  4.579e-01   2.897  0.00377 ** 
#> Property_AreaUrban     -3.592e-02  4.433e-01  -0.081  0.93542    
#> Total_Income            2.243e-04  1.632e-04   1.375  0.16920    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 325.53  on 263  degrees of freedom
#> Residual deviance: 216.97  on 246  degrees of freedom
#> AIC: 252.97
#> 
#> Number of Fisher Scoring iterations: 14

Now, let’s do the feature selection

model_step <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = FALSE)
summary(model_step)
#> 
#> Call:
#> glm(formula = Loan_Status ~ Credit_History + Property_Area + 
#>     Loan_Amount_Term, family = binomial, data = loan_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2974  -0.3309   0.4516   0.7913   2.4455  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)            -0.940051   1.272418  -0.739  0.46003    
#> Credit_History1         3.878285   0.580768   6.678 2.42e-11 ***
#> Property_AreaSemiurban  1.292314   0.427928   3.020  0.00253 ** 
#> Property_AreaUrban      0.061087   0.394078   0.155  0.87681    
#> Loan_Amount_Term       -0.005552   0.003510  -1.582  0.11375    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 325.53  on 263  degrees of freedom
#> Residual deviance: 227.52  on 259  degrees of freedom
#> AIC: 237.52
#> 
#> Number of Fisher Scoring iterations: 5

Using step-wise regression, we found that the function would look like this: > glm(formula = Loan_Status ~ Credit_History + Property_Area + Loan_Amount_Term + Gender + Education, family = binomial, data = loan_train)

However, based on our understanding of the Bank Industry, Total_Income and LoanAmount may affect the Loan Approval. This is because we would like to know whether the consumer will be able to repay the loan or not in the future. So we will create another model called model_optimum that adds Total_Incomeand LoanAmount to the model_step.

model_optimum <- glm(formula = Loan_Status ~ Credit_History + Property_Area + Loan_Amount_Term + LoanAmount + Total_Income, family = binomial, data = loan_train)

summary(model_optimum)
#> 
#> Call:
#> glm(formula = Loan_Status ~ Credit_History + Property_Area + 
#>     Loan_Amount_Term + LoanAmount + Total_Income, family = binomial, 
#>     data = loan_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2945  -0.3506   0.4517   0.7518   2.4694  
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)            -1.6853291  1.4628950  -1.152    0.249    
#> Credit_History1         3.9014176  0.5808989   6.716 1.87e-11 ***
#> Property_AreaSemiurban  1.3438571  0.4349054   3.090    0.002 ** 
#> Property_AreaUrban      0.0783288  0.4055971   0.193    0.847    
#> Loan_Amount_Term       -0.0049702  0.0035436  -1.403    0.161    
#> LoanAmount             -0.0047831  0.0052063  -0.919    0.358    
#> Total_Income            0.0002229  0.0001500   1.486    0.137    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 325.53  on 263  degrees of freedom
#> Residual deviance: 225.25  on 257  degrees of freedom
#> AIC: 239.25
#> 
#> Number of Fisher Scoring iterations: 5
loan_test

Predict

Using model_optimum above, we will try to predict using the test data that we have already made previously.

loan_test$prob_approve <- predict(model_optimum, type = "response", newdata = loan_test)
head(loan_test)

Let’s see the distribution of the probability of our data prediction!

unique(loan_test$Loan_Status)
#> [1] Y N
#> Levels: N Y

basis/negative level = N = 0 positive class = Y = 1

ggplot(loan_test, aes(x=prob_approve)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of the Probability of Loan Eligibility Prediction") +
  theme_minimal()

The graphic above tells us that the prediction result is skewed towards 1 which means “approved”.

Now let’s transform the data using numerical values by utilising the function of ifelse(). The threshold that we will use is 0.5 which means: - If the probability prediction > 0.5 -> 1/approved - if the probability prediction <=0.5 -> 0/not approved

# Please type your answer
loan_test$pred_label <- ifelse(test = loan_test$prob_approve > 0.5,
                                yes = "Y",
                                no = "N")

loan_test

Model Evaluation

Let’s evaluate the model using confusion Matrix. But first, let’s make sure that the pred_label is mutated into factor.

loan_test$pred_label <- as.factor(loan_test$pred_label)
str(loan_test$pred_label)
#>  Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
# confusion matrix

loan_model_evaluation <- confusionMatrix(data = loan_test$pred_label,
                reference = loan_test$Loan_Status,
                positive = "Y")

loan_model_evaluation
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  N  Y
#>          N  7  0
#>          Y 14 46
#>                                           
#>                Accuracy : 0.791           
#>                  95% CI : (0.6743, 0.8808)
#>     No Information Rate : 0.6866          
#>     P-Value [Acc > NIR] : 0.039911        
#>                                           
#>                   Kappa : 0.4071          
#>                                           
#>  Mcnemar's Test P-Value : 0.000512        
#>                                           
#>             Sensitivity : 1.0000          
#>             Specificity : 0.3333          
#>          Pos Pred Value : 0.7667          
#>          Neg Pred Value : 1.0000          
#>              Prevalence : 0.6866          
#>          Detection Rate : 0.6866          
#>    Detection Prevalence : 0.8955          
#>       Balanced Accuracy : 0.6667          
#>                                           
#>        'Positive' Class : Y               
#> 

Based on the above analysis: True Positive (TP): 46 False Positive (FP): 14 True Negative (TN): 7 False Negative (FN): 0

Let’s recall! - Sensitivity / Recall = from all the actual data that are actually positive, how capable is our model proportion can predict correctly. In our case, the sensitivity rate is: 100% - Specificity = from all the actual data that are actually negative, how capable is our model proportion can predict correctly. In our case the specificity rate is: 33.33% - Accuracy = how capable is our model that can predict the Y target. Our accuracy rate is: 79.1% - Precision / Post Pred Value = from all prediction result, how capable is our model that can correctly guess the positive class. Our precision rate is: 76.67%

Let’s double check whether the calculation above is already correct or not.

# Recall = TP/(TP+FN)
Recall <- round(46 / (46 + 0),2)
# Specificity = TN/(TN+FP)
Specificity <- round(7 / (7 + 14), 2)
# Accuracy = TP+TN/TOTAL
Accuracy <- round((46 + 7) / nrow(loan_test), 2)
# Precision = TP/(TP+FP)
Precision <- round(46 / (46 + 14), 2)

performance <- cbind.data.frame(Recall, Specificity, Accuracy, Precision)

performance

That means that our confusion Matrix analysis is valid.

From the result, we can conclude that our model is capable in predicting target Y (loan approved or not approved) around 79%.

From all the actual of non-approved loan, the model can only correctly predict only 33%.

Meanwhile, from all the actual approved loan, the model can correctly predict 100%.

From all the prediction result that our model provides, the model can correctly predict the positive class around 77%.

This means our model performs extremely well in terms of recall/sensitivity, pretty good in accuracy and precision, but perform the worse in specificity or in predicting the non-approved loan.

K-Nearest Neighbour (K-NN)

Pre-Processing Data

Since KNN works better using numerical values, first we need to transform all of our categorical variables (apart from our target variable: Loan_Status) into numerical by creating dummy bolooen columns. We also need to drop irrelevant variables to mimic model_optimum

loan_KNN_nooutlier <- loan_class_nooutlier[-c(1:5)]
categorical = c('Credit_History', 'Property_Area')
results <- fastDummies::dummy_cols(loan_KNN_nooutlier, select_columns = categorical) #creating dummy columns
res <- results[,!(names(results) %in% categorical)] #deleting initial columns
res

Next, we need to separate the data into train and test datasets.

RNGkind(sample.kind = "Rounding")
set.seed(123)
# your code here

indexKNN <- sample(x = nrow(res),
                size = nrow(res)*0.8)

loanKNN_train <- res[indexKNN,]
loanKNN_test <-  res[-indexKNN,]

Separate the predictor and the target!

# predictor for `train`
train_x <- loanKNN_train %>% select(-Loan_Status)

# predictor for  `test`
test_x <- loanKNN_test %>% select(-Loan_Status)

# target for `train`
train_y <- loanKNN_train %>% pull(Loan_Status)

# target for `test`
test_y <- loanKNN_test %>% pull(Loan_Status)

Now, let’s scale the data. Our predictor data will be scaled using z-score standardization. Similarly, our test data will also be scaled using the parameter from the train data as it considers that test data is unseen data

train_x_scale <- train_x %>%
  scale()

test_x_scale <- test_x %>%
  scale(center = attr(train_x_scale,"scaled:center"),
        scale = attr(train_x_scale,"scaled:scale"))

test_x_scale
#>       LoanAmount Loan_Amount_Term Total_Income Credit_History_0
#> 3    0.073610534        0.2504782   0.08043537       -0.4402648
#> 4    0.614136098        0.2504782   0.85443337       -0.4402648
#> 5   -0.569872281        0.2504782  -0.71768154       -0.4402648
#> 7    1.309097538        0.2504782   0.51238326       -0.4402648
#> 9   -0.209521904        0.2504782  -0.35882128       -0.4402648
#> 10  -0.080825341        0.2504782  -0.10082195       -0.4402648
#> 11  -2.577538662       -3.6973571  -1.78768444       -0.4402648
#> 12   0.202307097        0.2504782   0.08701326       -0.4402648
#> 17  -0.132303967        0.2504782   0.32966419        2.2627565
#> 23  -0.286739842        0.2504782  -0.10155282       -0.4402648
#> 24  -0.080825341        0.2504782   0.24561342       -0.4402648
#> 25   0.691354036        0.2504782   0.21564749       -0.4402648
#> 27  -0.955961969        0.2504782  -0.89966974       -0.4402648
#> 48   0.485439535        0.2504782  -0.38805633        2.2627565
#> 49   1.412054788        0.2504782   1.05469347       -0.4402648
#> 53  -0.106564654        2.2243959  -0.79003830       -0.4402648
#> 73   2.544584542        0.2504782   1.87766018        2.2627565
#> 87  -0.853004719        0.2504782  -0.18121834       -0.4402648
#> 91   1.309097538        0.2504782   0.42833249       -0.4402648
#> 95  -2.242927598        0.2504782  -1.36523794       -0.4402648
#> 98   0.897268537        0.2504782   1.09342992       -0.4402648
#> 99  -0.106564654        0.2504782  -1.86369557       -0.4402648
#> 100 -1.728141347        0.2504782  -1.15255294       -0.4402648
#> 108 -1.393530283        2.2243959  -1.89585413        2.2627565
#> 113 -0.415436405        0.2504782  -0.85289366       -0.4402648
#> 122 -1.058919220        0.2504782  -0.71110366       -0.4402648
#> 132 -1.496487534        0.2504782  -1.71898207       -0.4402648
#> 136  0.331003660        0.2504782   0.63736311       -0.4402648
#> 147  0.536918161        0.2504782  -0.07012514       -0.4402648
#> 151 -0.338218467        0.2504782  -0.20899164       -0.4402648
#> 155 -0.132303967        0.2504782  -0.74472396       -0.4402648
#> 162 -0.698568843        0.2504782  -0.84339226       -0.4402648
#> 163  2.158494853        0.2504782   0.53138605        2.2627565
#> 169  0.433960910        0.2504782  -0.41071350       -0.4402648
#> 170  0.974486475       -5.0790995  -0.98664402       -0.4402648
#> 173  0.871529224        0.2504782   1.98144462       -0.4402648
#> 176  0.459700223        0.2504782  -1.03268923       -0.4402648
#> 177 -0.698568843        0.2504782  -0.28865716       -0.4402648
#> 189 -0.724308156        0.2504782   1.00864827       -0.4402648
#> 192 -0.055086029        0.2504782  -0.51669056       -0.4402648
#> 194  0.253785722        0.2504782  -0.31496870        2.2627565
#> 200  0.459700223        0.2504782  -0.12055561       -0.4402648
#> 201  0.871529224        2.2243959   1.98071374       -0.4402648
#> 211 -0.158043279        0.2504782   0.36620800       -0.4402648
#> 219  1.720926539        0.2504782   2.30010668       -0.4402648
#> 220 -0.183782592        0.2504782   1.42744039       -0.4402648
#> 222 -0.003607404        0.2504782   0.98379847       -0.4402648
#> 225  1.103183037        0.2504782   0.80911904       -0.4402648
#> 229 -1.831098597        0.2504782  -1.79280057       -0.4402648
#> 233  1.103183037        0.2504782   0.09212939       -0.4402648
#> 248  0.974486475        0.2504782   0.25876919       -0.4402648
#> 249 -0.158043279       -2.7103983   0.15937001        2.2627565
#> 254  0.279525035        0.2504782   0.73237703       -0.4402648
#> 256  0.279525035        0.2504782   1.70736601       -0.4402648
#> 258 -0.106564654       -2.7103983  -0.85070103       -0.4402648
#> 260  0.485439535        0.2504782  -0.05916200       -0.4402648
#> 266 -0.209521904        0.2504782   0.53869481       -0.4402648
#> 279 -1.470748221        0.2504782  -1.34184990       -0.4402648
#> 280  1.103183037        0.2504782  -1.63054603       -0.4402648
#> 297 -0.183782592        0.2504782  -0.13078788       -0.4402648
#> 306  1.437794101        0.2504782   1.05688610       -0.4402648
#> 309  0.923007849        0.2504782   0.27192496       -0.4402648
#> 318 -0.441175718        0.2504782  -0.10228370       -0.4402648
#> 322  0.279525035        0.2504782   0.29385125       -0.4402648
#> 325  1.103183037        0.2504782   1.34119698       -0.4402648
#> 329 -0.312479155       -4.2895324  -0.43263979       -0.4402648
#> 331 -0.569872281        0.2504782  -1.41493753       -0.4402648
#>     Credit_History_1 Property_Area_Rural Property_Area_Semiurban
#> 3          0.4402648          -0.6463485              -0.8175236
#> 4          0.4402648          -0.6463485              -0.8175236
#> 5          0.4402648          -0.6463485              -0.8175236
#> 7          0.4402648          -0.6463485              -0.8175236
#> 9          0.4402648          -0.6463485              -0.8175236
#> 10         0.4402648           1.5412926              -0.8175236
#> 11         0.4402648          -0.6463485              -0.8175236
#> 12         0.4402648          -0.6463485              -0.8175236
#> 17        -2.2627565           1.5412926              -0.8175236
#> 23         0.4402648           1.5412926              -0.8175236
#> 24         0.4402648          -0.6463485               1.2185729
#> 25         0.4402648          -0.6463485               1.2185729
#> 27         0.4402648          -0.6463485              -0.8175236
#> 48        -2.2627565          -0.6463485               1.2185729
#> 49         0.4402648          -0.6463485              -0.8175236
#> 53         0.4402648          -0.6463485              -0.8175236
#> 73        -2.2627565          -0.6463485              -0.8175236
#> 87         0.4402648           1.5412926              -0.8175236
#> 91         0.4402648          -0.6463485              -0.8175236
#> 95         0.4402648          -0.6463485              -0.8175236
#> 98         0.4402648           1.5412926              -0.8175236
#> 99         0.4402648           1.5412926              -0.8175236
#> 100        0.4402648          -0.6463485              -0.8175236
#> 108       -2.2627565          -0.6463485               1.2185729
#> 113        0.4402648           1.5412926              -0.8175236
#> 122        0.4402648          -0.6463485               1.2185729
#> 132        0.4402648          -0.6463485              -0.8175236
#> 136        0.4402648           1.5412926              -0.8175236
#> 147        0.4402648          -0.6463485              -0.8175236
#> 151        0.4402648          -0.6463485               1.2185729
#> 155        0.4402648           1.5412926              -0.8175236
#> 162        0.4402648          -0.6463485              -0.8175236
#> 163       -2.2627565          -0.6463485               1.2185729
#> 169        0.4402648          -0.6463485               1.2185729
#> 170        0.4402648          -0.6463485               1.2185729
#> 173        0.4402648           1.5412926              -0.8175236
#> 176        0.4402648           1.5412926              -0.8175236
#> 177        0.4402648          -0.6463485              -0.8175236
#> 189        0.4402648           1.5412926              -0.8175236
#> 192        0.4402648          -0.6463485               1.2185729
#> 194       -2.2627565          -0.6463485               1.2185729
#> 200        0.4402648          -0.6463485              -0.8175236
#> 201        0.4402648           1.5412926              -0.8175236
#> 211        0.4402648          -0.6463485               1.2185729
#> 219        0.4402648          -0.6463485               1.2185729
#> 220        0.4402648          -0.6463485              -0.8175236
#> 222        0.4402648          -0.6463485              -0.8175236
#> 225        0.4402648           1.5412926              -0.8175236
#> 229        0.4402648           1.5412926              -0.8175236
#> 233        0.4402648          -0.6463485               1.2185729
#> 248        0.4402648          -0.6463485               1.2185729
#> 249       -2.2627565          -0.6463485              -0.8175236
#> 254        0.4402648          -0.6463485               1.2185729
#> 256        0.4402648          -0.6463485              -0.8175236
#> 258        0.4402648          -0.6463485              -0.8175236
#> 260        0.4402648          -0.6463485              -0.8175236
#> 266        0.4402648           1.5412926              -0.8175236
#> 279        0.4402648          -0.6463485              -0.8175236
#> 280        0.4402648          -0.6463485              -0.8175236
#> 297        0.4402648           1.5412926              -0.8175236
#> 306        0.4402648          -0.6463485              -0.8175236
#> 309        0.4402648           1.5412926              -0.8175236
#> 318        0.4402648          -0.6463485               1.2185729
#> 322        0.4402648          -0.6463485               1.2185729
#> 325        0.4402648          -0.6463485               1.2185729
#> 329        0.4402648          -0.6463485               1.2185729
#> 331        0.4402648          -0.6463485               1.2185729
#>     Property_Area_Urban
#> 3             1.5137001
#> 4             1.5137001
#> 5             1.5137001
#> 7             1.5137001
#> 9             1.5137001
#> 10           -0.6581305
#> 11            1.5137001
#> 12            1.5137001
#> 17           -0.6581305
#> 23           -0.6581305
#> 24           -0.6581305
#> 25           -0.6581305
#> 27            1.5137001
#> 48           -0.6581305
#> 49            1.5137001
#> 53            1.5137001
#> 73            1.5137001
#> 87           -0.6581305
#> 91            1.5137001
#> 95            1.5137001
#> 98           -0.6581305
#> 99           -0.6581305
#> 100           1.5137001
#> 108          -0.6581305
#> 113          -0.6581305
#> 122          -0.6581305
#> 132           1.5137001
#> 136          -0.6581305
#> 147           1.5137001
#> 151          -0.6581305
#> 155          -0.6581305
#> 162           1.5137001
#> 163          -0.6581305
#> 169          -0.6581305
#> 170          -0.6581305
#> 173          -0.6581305
#> 176          -0.6581305
#> 177           1.5137001
#> 189          -0.6581305
#> 192          -0.6581305
#> 194          -0.6581305
#> 200           1.5137001
#> 201          -0.6581305
#> 211          -0.6581305
#> 219          -0.6581305
#> 220           1.5137001
#> 222           1.5137001
#> 225          -0.6581305
#> 229          -0.6581305
#> 233          -0.6581305
#> 248          -0.6581305
#> 249           1.5137001
#> 254          -0.6581305
#> 256           1.5137001
#> 258           1.5137001
#> 260           1.5137001
#> 266          -0.6581305
#> 279           1.5137001
#> 280           1.5137001
#> 297          -0.6581305
#> 306           1.5137001
#> 309          -0.6581305
#> 318          -0.6581305
#> 322          -0.6581305
#> 325          -0.6581305
#> 329          -0.6581305
#> 331          -0.6581305
#> attr(,"scaled:center")
#>              LoanAmount        Loan_Amount_Term            Total_Income 
#>             117.1401515             344.7727273            4830.9466666 
#>        Credit_History_0        Credit_History_1     Property_Area_Rural 
#>               0.1628788               0.8371212               0.2954545 
#> Property_Area_Semiurban     Property_Area_Urban 
#>               0.4015152               0.3030303 
#> attr(,"scaled:scale")
#>              LoanAmount        Loan_Amount_Term            Total_Income 
#>              38.8510764              60.7928091            1368.2205953 
#>        Credit_History_0        Credit_History_1     Property_Area_Rural 
#>               0.3699564               0.3699564               0.4571134 
#> Property_Area_Semiurban     Property_Area_Urban 
#>               0.4911359               0.4604411

Predict

Now, let’s calculate the most optimum K value.

# find optimum k
sqrt(nrow(train_x))
#> [1] 16.24808

K -> 16

Next, let’s start predicting using the knn() function.

loan_pred <- knn(train = train_x_scale,
                   test = test_x_scale,
                   cl = train_y,
                   k = 16)

loan_pred
#>  [1] Y Y Y Y Y Y Y Y N Y Y Y Y N Y Y N Y Y Y Y Y Y N Y Y Y Y Y Y Y Y N Y Y Y Y Y
#> [39] Y Y N Y Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
#> Levels: N Y

Model Evaluation

# confusion matrix

confusionMatrix(data = loan_pred,
                reference = test_y,
                positive = "Y")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  N  Y
#>          N  7  0
#>          Y 14 46
#>                                           
#>                Accuracy : 0.791           
#>                  95% CI : (0.6743, 0.8808)
#>     No Information Rate : 0.6866          
#>     P-Value [Acc > NIR] : 0.039911        
#>                                           
#>                   Kappa : 0.4071          
#>                                           
#>  Mcnemar's Test P-Value : 0.000512        
#>                                           
#>             Sensitivity : 1.0000          
#>             Specificity : 0.3333          
#>          Pos Pred Value : 0.7667          
#>          Neg Pred Value : 1.0000          
#>              Prevalence : 0.6866          
#>          Detection Rate : 0.6866          
#>    Detection Prevalence : 0.8955          
#>       Balanced Accuracy : 0.6667          
#>                                           
#>        'Positive' Class : Y               
#> 

Based on the above analysis: True Positive (TP): 46 False Positive (FP): 14 True Negative (TN): 7 False Negative (FN): 0

Now, let’s double check:

# Recall = TP/(TP+FN)
RecallKNN <- round(46 / (46 + 0),2)
# Specificity = TN/(TN+FP)
SpecificityKNN <- round(7 / (7 + 14), 2)
# Accuracy = TP+TN/TOTAL
AccuracyKNN <- round((46 + 7) / (46+14+7), 2)
# Precision = TP/(TP+FP)
PrecisionKNN <- round(46 / (46 + 14), 2)

performance <- cbind.data.frame(RecallKNN, SpecificityKNN, AccuracyKNN, PrecisionKNN)

performance

Let’s compare the two models (Logistic Regression vs. K-NN):

Comparison of Performance between Logistic Regression and KNN Model
Comparison Logistic Regression KNN
Recall 100% 100%
Specificity 33% 33%
Accuracy 79% 79%
Precision 77% 77%

From the result, we can conclude that if we use exactly the same model (model_optimum), regardless of the method whether it’s logistic regression or KNN, you’ll bound to get the same result.

Conclusion

We can fairly assume that both Logistic Regression and KNN models perform equally good when doing classification for predicting loan eligibility for our Bank’s customers. However, we should take our model with a grain of salt as the class balance was not ideal (not 1:1, but rather 1:2). This may be the reason behind the low score for Specificity or predicting unapproved loan. This warrants for model improvement using either upSampling/downSampling/ROSE/SMOTE which will be further discussed in the next module.