Loan Prediction Using Logistic Regression and K-Nearest Neighbor
Introduction
In this case study, I will examine loan prediction data, look for any indicators that can be used to predict approval or disapproval (Loan Status), and want to automate the loan eligibility process (in real-time) based on the customer details provided when filling out the online application form. Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and other details are included.
The Banking industry employs the most analytical methods and data science of any industry. This data set will provide me with enough flavor to work on data sets from banks, such as what challenges they face and what strategies they employ, among other things. This is a classification situation. The data consists of 500 rows and 15 features that can be used to predict whether or not a loan will be approved.
On this occasion, I will try to make predictions with the algorithm that I will used Logistic Regression and K-Nearest Neighbor which is included in supervised learning.
Data Preparation
Import data:
loan <- read.csv("data_input/df1_loan.csv", stringsAsFactors = T)
loanI’ll remove the $ symbol from the total income column.
loan$Total_Income <- sub("([.])|[[:punct:]]", "\\1", as.matrix(loan$Total_Income))
loan$Total_Income <- as.numeric(loan$Total_Income)
loanData Cleansing
Check data types :
str(loan)#> 'data.frame': 500 obs. of 15 variables:
#> $ X : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ Loan_ID : Factor w/ 500 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#> $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
#> $ CoapplicantIncome: num 0 1508 0 2358 0 ...
#> $ LoanAmount : num NA 128 66 120 141 267 95 158 168 349 ...
#> $ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : num 1 1 1 1 1 1 1 0 1 1 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#> $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#> $ Total_Income : num 5849 6091 3000 4941 6000 ...
Change the data type and remove any unnecessary columns:
library(dplyr)
loan_clean <- loan %>%
select(-c(X, Loan_ID)) %>%
mutate(Credit_History = as.factor(Credit_History))
loan_cleanstr(loan_clean)#> 'data.frame': 500 obs. of 13 variables:
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#> $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
#> $ CoapplicantIncome: num 0 1508 0 2358 0 ...
#> $ LoanAmount : num NA 128 66 120 141 267 95 158 168 349 ...
#> $ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#> $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#> $ Total_Income : num 5849 6091 3000 4941 6000 ...
Check missing value:
colSums(is.na(loan_clean))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
#> 0 0 0 18
#> Loan_Amount_Term Credit_History Property_Area Loan_Status
#> 14 41 0 0
#> Total_Income
#> 0
Remove any rows that have missing values:
loan_clean <- loan_clean %>%
filter(complete.cases(.))
colSums(is.na(loan_clean))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
#> 0 0 0 0
#> Loan_Amount_Term Credit_History Property_Area Loan_Status
#> 0 0 0 0
#> Total_Income
#> 0
Check table propotion:
prop.table(table(loan_clean$Loan_Status))#>
#> N Y
#> 0.3107477 0.6892523
Check Summary Data:
summary(loan_clean)#> Gender Married Dependents Education Self_Employed
#> : 8 : 2 : 9 Graduate :342 : 21
#> Female: 77 No :154 0 :245 Not Graduate: 86 No :352
#> Male :343 Yes:272 1 : 68 Yes: 55
#> 2 : 71
#> 3+: 35
#>
#> ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
#> Min. : 150 Min. : 0 Min. : 17.0 Min. : 36.0
#> 1st Qu.: 2880 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360.0
#> Median : 3863 Median : 1062 Median :127.5 Median :360.0
#> Mean : 5627 Mean : 1503 Mean :144.0 Mean :342.8
#> 3rd Qu.: 5818 3rd Qu.: 2212 3rd Qu.:162.0 3rd Qu.:360.0
#> Max. :81000 Max. :20000 Max. :700.0 Max. :480.0
#> Credit_History Property_Area Loan_Status Total_Income
#> 0: 63 Rural :123 N:133 Min. : 1442
#> 1:365 Semiurban:167 Y:295 1st Qu.: 4166
#> Urban :138 Median : 5274
#> Mean : 7131
#> 3rd Qu.: 7544
#> Max. :81000
Logistic Regression
Cross Validation
The purpose of cross validation is to determine how well our model works
RNGkind(sample.kind = "Rounding")
set.seed(100)
index_loan <- sample(nrow(loan_clean), nrow(loan_clean)*0.8)
loan_train <- loan_clean[index_loan,]
loan_test <- loan_clean[-index_loan,]Check table propotion:
loan_train$Loan_Status %>%
table() %>%
prop.table()#> .
#> N Y
#> 0.3157895 0.6842105
Build Model
Created a model based on business knowledge:
loan_risk <- glm(formula = Loan_Status ~ . , data = loan_train, family = "binomial")
# using the stepwise method
loan_risk_back <- step(object = loan_risk, direction="backward", trace = 0)
summary(loan_risk_back)#>
#> Call:
#> glm(formula = Loan_Status ~ Loan_Amount_Term + Credit_History +
#> Property_Area, family = "binomial", data = loan_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7052 -0.3290 0.5103 0.7926 2.4264
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.336430 1.042366 -1.282 0.199803
#> Loan_Amount_Term -0.005124 0.002707 -1.892 0.058435 .
#> Credit_History1 3.886297 0.561857 6.917 4.62e-12 ***
#> Property_AreaSemiurban 1.267555 0.359440 3.526 0.000421 ***
#> Property_AreaUrban 0.291414 0.333679 0.873 0.382480
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 426.58 on 341 degrees of freedom
#> Residual deviance: 314.98 on 337 degrees of freedom
#> AIC: 324.98
#>
#> Number of Fisher Scoring iterations: 5
Predict
Predict data test using a built-in model: predict(model, newdata, type)
library(class)
loan_pred <- predict(object = loan_risk_back, newdata = loan_test, type = "response")loan_pred_label <- as.factor(ifelse(loan_pred > 0.5, "Y", "N")) table(loan_pred_label)#> loan_pred_label
#> N Y
#> 14 72
table(predict = loan_pred_label,
actual = loan_test$Loan_Status)#> actual
#> predict N Y
#> N 12 2
#> Y 13 59
Model Evaluation
library(caret)
CM_LR <- confusionMatrix(data = loan_pred_label, reference = loan_test$Loan_Status, positive = "Y")
CM_LR#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction N Y
#> N 12 2
#> Y 13 59
#>
#> Accuracy : 0.8256
#> 95% CI : (0.7287, 0.899)
#> No Information Rate : 0.7093
#> P-Value [Acc > NIR] : 0.009537
#>
#> Kappa : 0.5139
#>
#> Mcnemar's Test P-Value : 0.009823
#>
#> Sensitivity : 0.9672
#> Specificity : 0.4800
#> Pos Pred Value : 0.8194
#> Neg Pred Value : 0.8571
#> Prevalence : 0.7093
#> Detection Rate : 0.6860
#> Detection Prevalence : 0.8372
#> Balanced Accuracy : 0.7236
#>
#> 'Positive' Class : Y
#>
- Accuracy: how well our model predicts the target class (globally)
- Sensitivity/Recall: a measure of the model’s goodness to the ‘positive’ class (reference is actual data)
- Specificity: a measure of the model’s resiliency to the ‘negative.’
- Post Pred Value/Precision: how well the model predicts the positive class.
Based on the confusion matrix results, we can conclude that the model’s ability to predict the target Y (Loan Status Approved or Not) is 82.5%. Meanwhile, based on actual data from people whose loan status was not approved, the model was able to determine correctly 48% of the cases. The model was able to correctly predict 96.7% of the time from all actual data of people who received approved loan status. The model correctly guessed the positive class 81.9% of the time based on the overall prediction results it was able to guess.
K-nearest neighbor (K-NN)
Cross-Validation
set.seed(100)
index_loan <- sample(nrow(loan_clean), nrow(loan_clean)*0.8)
loan_train <- loan_clean[index_loan,]
loan_test <- loan_clean[-index_loan,]Before we can scale, we must first separate predictors and targets:
# prediktor data train
train_x <- loan_train %>%
select_if(is.numeric)
#target data train
train_y <- loan_train %>%
select(Loan_Status)
# prediktor data test
test_x <- loan_test %>%
select_if(is.numeric)
test_y <- loan_test %>%
select(Loan_Status)Scaling for train and test data:
train_x_scaled <- scale(train_x)
test_x_scaled <- scale(test_x,
center = attr(train_x_scaled,"scaled:center"),
scale = attr(train_x_scaled, "scaled:scale")) Selection of the value of k from the observation’s root (data train):
train_x %>%
nrow() %>%
sqrt()#> [1] 18.49324
K = 18
use the knn() function from the package class to model
train : data predictor of train data
test : predictor data from test data
cl : target data from train data
k : the number of k you want to use
library(class)
loan_pred <- knn(train = train_x_scaled,
test = test_x_scaled,
cl = train_y$Loan_Status,
k = 18)
loan_pred#> [1] Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
#> [39] Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y N Y N
#> [77] Y Y Y Y Y Y Y Y Y Y
#> Levels: N Y
Evaluation Model
library(caret)
CM_KNN <- confusionMatrix(data = loan_pred, reference = test_y$Loan_Status, positive = "Y")
CM_KNN#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction N Y
#> N 4 0
#> Y 21 61
#>
#> Accuracy : 0.7558
#> 95% CI : (0.6513, 0.842)
#> No Information Rate : 0.7093
#> P-Value [Acc > NIR] : 0.2044
#>
#> Kappa : 0.2127
#>
#> Mcnemar's Test P-Value : 1.275e-05
#>
#> Sensitivity : 1.0000
#> Specificity : 0.1600
#> Pos Pred Value : 0.7439
#> Neg Pred Value : 1.0000
#> Prevalence : 0.7093
#> Detection Rate : 0.7093
#> Detection Prevalence : 0.9535
#> Balanced Accuracy : 0.5800
#>
#> 'Positive' Class : Y
#>
- Accuracy: how well our model predicts the target class (globally)
- Sensitivity/Recall: a measure of the model’s goodness to the ‘positive’ class (reference is actual data)
- Specificity: a measure of the model’s resiliency to the ‘negative.’
- Post Pred Value/Precision: how well the model predicts the positive class.
Based on the confusion matrix results, we can conclude that the model’s ability to predict the target Y (Loan Status Approved or Not) is 75.5%. Meanwhile, based on actual data from people whose loan status was not approved, the model was able to determine correctly 16% of the cases. The model was able to correctly predict 100% of the time from all actual data of people who received approved loan status. The model correctly guessed the positive class 74.3% of the time based on the overall prediction results it was able to guess.
Conclusion
In the case, treatment will I give to prospective customers to predict whether their loan will be approved or denied? As a result, I’ll use the recall metric, because I don’t want my model to predict customers who are likely to be disapproved but predict that it will harm the company and add risk.
When comparing with two models, the Logistic Regression and K-Nearest Neighbor model, the model of being able to correctly predict the actual data of approved customers is better using the K-Nearest Neighbor model because it has a recall value that has 100% , compared to the logistic regression model that has 96.7%. So, by using this model, companies can reduce the risk of lending.