A Bank Manager is tasked by the Company to classify whether the Bank’s customers are eligible for a loan or not when they first fill in their online application form. This is to specifically target the potential customers for further probing by the Bank for loan application.
To do that, we must help the Bank Manager by conducting a classification method using logistic regression and K-Nearest Neighbor (K-NN). The dataset that we use are derived from: https://www.kaggle.com/datasets/vipin20/loan-application-data.
# load all libraries for further usage
library(data.table)
library(dplyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(tidyr)
library(gtools)
library(fastDummies)loan_class <- read.csv("datainput/df1_loan.csv", stringsAsFactors = T)
glimpse(loan_class)#> Rows: 500
#> Columns: 15
#> $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
#> $ Loan_ID <fct> LP001002, LP001003, LP001005, LP001006, LP001008, LP~
#> $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male~
#> $ Married <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,~
#> $ Dependents <fct> 0, 1, 0, 0, 0, 2, 0, 3+, 2, 1, 2, 2, 2, 0, 2, 0, 1, ~
#> $ Education <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate~
#> $ Self_Employed <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, , No, ~
#> $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006~
#> $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, ~
#> $ LoanAmount <dbl> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1~
#> $ Loan_Amount_Term <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36~
#> $ Credit_History <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, ~
#> $ Property_Area <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem~
#> $ Loan_Status <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N~
#> $ Total_Income <fct> $5849.0, $6091.0, $3000.0, $4941.0, $6000.0, $9613.0~
It seems that: - Total_Income variable is not considered
as numerical values due to it using dollar sign ($). Let’s change it by
first deleting the column and creating a new one by summing up
ApplicantIncome and CoapplicantIncome
variables so that it has numerical values. - Credit_History
is still considered as numerical. Let’s change it to factor. - Let’s
also remove unnecessary columns such as X and
Loan_ID columns. We will also delete
ApplicantIncome and CoapplicantIncome to avoid
reduncancy with Total_Income which can be used as a proxy
for Income.
loan_class$Credit_History <- as.factor(loan_class$Credit_History)
loan_class1 <- loan_class[-c(1:2,15)]
loan_class1$Total_Income <- rowSums(loan_class1[,c("ApplicantIncome", "CoapplicantIncome")])
loan_class2 <- loan_class1[-c(6:7)]
str(loan_class2)#> 'data.frame': 500 obs. of 11 variables:
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#> $ LoanAmount : num NA 128 66 120 141 267 95 158 168 349 ...
#> $ Loan_Amount_Term: num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
#> $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
#> $ Total_Income : num 5849 6091 3000 4941 6000 ...
Based on the information that we got from the dataset, the following are the variables description:
Gender: Gender of applicant male or femaleMarried: Married Status! Yes or noDependents: Dependents of applicantEducation: Education, Graduate or Not GraduateSelf_Employed: Self_Employed! Yes or NoLoanAmount: Loan Amount for loanLoan_Amount_Term: Loan Amount Term (the length of time
it will take for a loan to be completely paid off when the borrower is
making regular payments)Credit_History: Credit History (Has a credit history =
1, No credit history = 0)Property_Area: Property AreaLoan_Status: Approved Loan (Yes = loan approved, No =
loan not approved)Total_Income: Total Income in a household of
ApplicantNext, we need to check whether our dataset has any missing value or not.
anyNA(loan_class2)#> [1] TRUE
colSums(is.na(loan_class2))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed LoanAmount Loan_Amount_Term Credit_History
#> 0 18 14 41
#> Property_Area Loan_Status Total_Income
#> 0 0 0
Since the data has several missing values, let’s exclude those data that have missing values.
loan_class_clean <- na.omit(loan_class2)
anyNA(loan_class_clean)#> [1] FALSE
colSums(is.na(loan_class_clean))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed LoanAmount Loan_Amount_Term Credit_History
#> 0 0 0 0
#> Property_Area Loan_Status Total_Income
#> 0 0 0
Now let’s see the data to see whether there is any outlier or not, to make sure that our subsequent analysis is not biased towards to outlier.
summary(loan_class_clean)#> Gender Married Dependents Education Self_Employed
#> : 8 : 2 : 9 Graduate :342 : 21
#> Female: 77 No :154 0 :245 Not Graduate: 86 No :352
#> Male :343 Yes:272 1 : 68 Yes: 55
#> 2 : 71
#> 3+: 35
#>
#> LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
#> Min. : 17.0 Min. : 36.0 0: 63 Rural :123 N:133
#> 1st Qu.:100.0 1st Qu.:360.0 1:365 Semiurban:167 Y:295
#> Median :127.5 Median :360.0 Urban :138
#> Mean :144.0 Mean :342.8
#> 3rd Qu.:162.0 3rd Qu.:360.0
#> Max. :700.0 Max. :480.0
#> Total_Income
#> Min. : 1442
#> 1st Qu.: 4166
#> Median : 5274
#> Mean : 7131
#> 3rd Qu.: 7544
#> Max. :81000
boxplot(loan_class_clean)It can be seen from the above boxplot, that the variable of
Total_Income has the most outliers.
Let’s confirm it by using another boxplot and histogram graphic.
boxplot(loan_class_clean$Total_Income)hist(loan_class_clean$Total_Income, breaks = 50)To make sure that our data is not biased from the outliers, let’s clean this up.
# filter
loan_class_nooutlier <- loan_class_clean[loan_class_clean$Total_Income < 8000,]
summary(loan_class_nooutlier)#> Gender Married Dependents Education Self_Employed
#> : 4 : 2 : 8 Graduate :247 : 19
#> Female: 67 No :121 0 :197 Not Graduate: 84 No :276
#> Male :260 Yes:208 1 : 49 Yes: 36
#> 2 : 54
#> 3+: 23
#>
#> LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
#> Min. : 17.0 Min. : 36.0 0: 50 Rural : 96 N:102
#> 1st Qu.: 96.0 1st Qu.:360.0 1:281 Semiurban:127 Y:229
#> Median :118.0 Median :360.0 Urban :108
#> Mean :117.7 Mean :345.3
#> 3rd Qu.:137.5 3rd Qu.:360.0
#> Max. :275.0 Max. :480.0
#> Total_Income
#> Min. :1442
#> 1st Qu.:3791
#> Median :4727
#> Mean :4819
#> 3rd Qu.:5733
#> Max. :7978
Just to be safe, let’s see again whether we have really removed the outliers on our dataset.
boxplot(loan_class_nooutlier$Total_Income)hist(loan_class_nooutlier$Total_Income, breaks = 50)Based on the boxplot and histogram above, it’s safe to say that our
dataset of loan_class_nooutlier is considerably free from
outliers.
GREAT! Let’s proceed with pre-processing our Data
Before we start the modelling, we need to check the proportion of the
target variable, which is Loan_Status
# Check proportion of target variable
prop.table(table(loan_class_nooutlier$Loan_Status)) %>% round(2)#>
#> N Y
#> 0.31 0.69
table(loan_class_nooutlier$Loan_Status)#>
#> N Y
#> 102 229
Since the ratio of the data is close to 1:2, we can consider that the data is considerably balanced. Therefore, we can proceed to the next step.
The next step that we should do is to split or seperate the train data and test data. The train data will be used for modelling, while the test data will be used to test our model when facing unseen data. The data will be split with a ratio of 80 over 20 (80% for the train data, and 20% for the test data)
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(loan_class_nooutlier),
size = nrow(loan_class_nooutlier)*0.8)
#splitting
loan_train <- loan_class_nooutlier[index,]
loan_test <- loan_class_nooutlier[-index,]Just to be on the safe side, let’s confirm whether
loan_train is indeed 80% of the
loan_class_clean
nrow(loan_train)#> [1] 264
nrow(loan_class_nooutlier)*0.8#> [1] 264.8
Yep, more or less the number is similar. Now, let’s check the proportion of the target variable in the train data.
# recheck class balance
prop.table(table(loan_train$Loan_Status))#>
#> N Y
#> 0.3068182 0.6931818
The class balance is similar to the loan_class_clean
dataset. That means, it’s good to go~!
Logistic Regression Model is built by using the function of
glm(). However, since we want to make sure that we develop
the best fit model, we will also do a feature selection using
step-wise regression process using both backward and forward model.
To be able to do a both-step regression process, we will need to set
the upper and lower threshold of the model by creating
model_none and model_all first.
model_none <- glm(formula = Loan_Status ~ 1, data = loan_train, family = binomial)
model_all <- glm(formula = Loan_Status~., data = loan_train, family = binomial)
summary(model_all)#>
#> Call:
#> glm(formula = Loan_Status ~ ., family = binomial, data = loan_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2305 -0.3414 0.3899 0.6598 2.3865
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.221e+01 1.018e+03 0.012 0.99044
#> GenderFemale 8.635e-01 1.466e+00 0.589 0.55579
#> GenderMale 1.219e+00 1.405e+00 0.868 0.38556
#> MarriedNo -1.319e+01 1.018e+03 -0.013 0.98967
#> MarriedYes -1.313e+01 1.018e+03 -0.013 0.98971
#> Dependents0 -1.097e+00 1.743e+00 -0.630 0.52898
#> Dependents1 -1.908e+00 1.791e+00 -1.065 0.28680
#> Dependents2 -1.019e+00 1.788e+00 -0.570 0.56863
#> Dependents3+ -1.046e+00 1.831e+00 -0.571 0.56770
#> EducationNot Graduate -5.373e-01 3.894e-01 -1.380 0.16761
#> Self_EmployedNo -2.984e-02 7.558e-01 -0.039 0.96851
#> Self_EmployedYes -2.702e-01 9.074e-01 -0.298 0.76587
#> LoanAmount -5.070e-03 5.537e-03 -0.916 0.35983
#> Loan_Amount_Term -6.502e-03 3.798e-03 -1.712 0.08688 .
#> Credit_History1 4.045e+00 6.107e-01 6.624 3.49e-11 ***
#> Property_AreaSemiurban 1.326e+00 4.579e-01 2.897 0.00377 **
#> Property_AreaUrban -3.592e-02 4.433e-01 -0.081 0.93542
#> Total_Income 2.243e-04 1.632e-04 1.375 0.16920
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 325.53 on 263 degrees of freedom
#> Residual deviance: 216.97 on 246 degrees of freedom
#> AIC: 252.97
#>
#> Number of Fisher Scoring iterations: 14
Now, let’s do the feature selection
model_step <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = FALSE)
summary(model_step)#>
#> Call:
#> glm(formula = Loan_Status ~ Credit_History + Property_Area +
#> Loan_Amount_Term, family = binomial, data = loan_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2974 -0.3309 0.4516 0.7913 2.4455
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.940051 1.272418 -0.739 0.46003
#> Credit_History1 3.878285 0.580768 6.678 2.42e-11 ***
#> Property_AreaSemiurban 1.292314 0.427928 3.020 0.00253 **
#> Property_AreaUrban 0.061087 0.394078 0.155 0.87681
#> Loan_Amount_Term -0.005552 0.003510 -1.582 0.11375
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 325.53 on 263 degrees of freedom
#> Residual deviance: 227.52 on 259 degrees of freedom
#> AIC: 237.52
#>
#> Number of Fisher Scoring iterations: 5
Using step-wise regression, we found that the function would look like this: > glm(formula = Loan_Status ~ Credit_History + Property_Area + Loan_Amount_Term + Gender + Education, family = binomial, data = loan_train)
However, based on our understanding of the Bank Industry,
Total_Income and LoanAmount may affect the
Loan Approval. This is because we would like to know whether the
consumer will be able to repay the loan or not in the future. So we will
create another model called model_optimum that adds
Total_Incomeand LoanAmount to the
model_step.
model_optimum <- glm(formula = Loan_Status ~ Credit_History + Property_Area + Loan_Amount_Term + LoanAmount + Total_Income, family = binomial, data = loan_train)
summary(model_optimum)#>
#> Call:
#> glm(formula = Loan_Status ~ Credit_History + Property_Area +
#> Loan_Amount_Term + LoanAmount + Total_Income, family = binomial,
#> data = loan_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2945 -0.3506 0.4517 0.7518 2.4694
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.6853291 1.4628950 -1.152 0.249
#> Credit_History1 3.9014176 0.5808989 6.716 1.87e-11 ***
#> Property_AreaSemiurban 1.3438571 0.4349054 3.090 0.002 **
#> Property_AreaUrban 0.0783288 0.4055971 0.193 0.847
#> Loan_Amount_Term -0.0049702 0.0035436 -1.403 0.161
#> LoanAmount -0.0047831 0.0052063 -0.919 0.358
#> Total_Income 0.0002229 0.0001500 1.486 0.137
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 325.53 on 263 degrees of freedom
#> Residual deviance: 225.25 on 257 degrees of freedom
#> AIC: 239.25
#>
#> Number of Fisher Scoring iterations: 5
loan_testUsing model_optimum above, we will try to predict using
the test data that we have already made previously.
loan_test$prob_approve <- predict(model_optimum, type = "response", newdata = loan_test)
head(loan_test)Let’s see the distribution of the probability of our data prediction!
unique(loan_test$Loan_Status)#> [1] Y N
#> Levels: N Y
basis/negative level = N = 0 positive class = Y = 1
ggplot(loan_test, aes(x=prob_approve)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of the Probability of Loan Eligibility Prediction") +
theme_minimal()The graphic above tells us that the prediction result is skewed towards 1 which means “approved”.
Now let’s transform the data using numerical values by utilising the
function of ifelse(). The threshold that we will use is
0.5 which means: - If the probability prediction > 0.5 ->
1/approved - if the probability prediction <=0.5 -> 0/not
approved
# Please type your answer
loan_test$pred_label <- ifelse(test = loan_test$prob_approve > 0.5,
yes = "Y",
no = "N")
loan_testLet’s evaluate the model using confusion Matrix. But first, let’s make sure that the pred_label is mutated into factor.
loan_test$pred_label <- as.factor(loan_test$pred_label)
str(loan_test$pred_label)#> Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
# confusion matrix
loan_model_evaluation <- confusionMatrix(data = loan_test$pred_label,
reference = loan_test$Loan_Status,
positive = "Y")
loan_model_evaluation#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction N Y
#> N 7 0
#> Y 14 46
#>
#> Accuracy : 0.791
#> 95% CI : (0.6743, 0.8808)
#> No Information Rate : 0.6866
#> P-Value [Acc > NIR] : 0.039911
#>
#> Kappa : 0.4071
#>
#> Mcnemar's Test P-Value : 0.000512
#>
#> Sensitivity : 1.0000
#> Specificity : 0.3333
#> Pos Pred Value : 0.7667
#> Neg Pred Value : 1.0000
#> Prevalence : 0.6866
#> Detection Rate : 0.6866
#> Detection Prevalence : 0.8955
#> Balanced Accuracy : 0.6667
#>
#> 'Positive' Class : Y
#>
Based on the above analysis: True Positive (TP): 46 False Positive (FP): 14 True Negative (TN): 7 False Negative (FN): 0
Let’s recall! - Sensitivity / Recall = from all the actual data that are actually positive, how capable is our model proportion can predict correctly. In our case, the sensitivity rate is: 100% - Specificity = from all the actual data that are actually negative, how capable is our model proportion can predict correctly. In our case the specificity rate is: 33.33% - Accuracy = how capable is our model that can predict the Y target. Our accuracy rate is: 79.1% - Precision / Post Pred Value = from all prediction result, how capable is our model that can correctly guess the positive class. Our precision rate is: 76.67%
Let’s double check whether the calculation above is already correct or not.
# Recall = TP/(TP+FN)
Recall <- round(46 / (46 + 0),2)
# Specificity = TN/(TN+FP)
Specificity <- round(7 / (7 + 14), 2)
# Accuracy = TP+TN/TOTAL
Accuracy <- round((46 + 7) / nrow(loan_test), 2)
# Precision = TP/(TP+FP)
Precision <- round(46 / (46 + 14), 2)
performance <- cbind.data.frame(Recall, Specificity, Accuracy, Precision)
performanceThat means that our confusion Matrix analysis is valid.
From the result, we can conclude that our model is capable in predicting target Y (loan approved or not approved) around 79%.
From all the actual of non-approved loan, the model can only correctly predict only 33%.
Meanwhile, from all the actual approved loan, the model can correctly predict 100%.
From all the prediction result that our model provides, the model can correctly predict the positive class around 77%.
This means our model performs extremely well in terms of recall/sensitivity, pretty good in accuracy and precision, but perform the worse in specificity or in predicting the non-approved loan.
Since KNN works better using numerical values, first we need to
transform all of our categorical variables (apart from our target
variable: Loan_Status) into numerical by creating dummy bolooen columns.
We also need to drop irrelevant variables to mimic
model_optimum
loan_KNN_nooutlier <- loan_class_nooutlier[-c(1:5)]
categorical = c('Credit_History', 'Property_Area')
results <- fastDummies::dummy_cols(loan_KNN_nooutlier, select_columns = categorical) #creating dummy columns
res <- results[,!(names(results) %in% categorical)] #deleting initial columns
resNext, we need to separate the data into train and test datasets.
RNGkind(sample.kind = "Rounding")
set.seed(123)
# your code here
indexKNN <- sample(x = nrow(res),
size = nrow(res)*0.8)
loanKNN_train <- res[indexKNN,]
loanKNN_test <- res[-indexKNN,]Separate the predictor and the target!
# predictor for `train`
train_x <- loanKNN_train %>% select(-Loan_Status)
# predictor for `test`
test_x <- loanKNN_test %>% select(-Loan_Status)
# target for `train`
train_y <- loanKNN_train %>% pull(Loan_Status)
# target for `test`
test_y <- loanKNN_test %>% pull(Loan_Status)Now, let’s scale the data. Our predictor data will be scaled using z-score standardization. Similarly, our test data will also be scaled using the parameter from the train data as it considers that test data is unseen data
train_x_scale <- train_x %>%
scale()
test_x_scale <- test_x %>%
scale(center = attr(train_x_scale,"scaled:center"),
scale = attr(train_x_scale,"scaled:scale"))
test_x_scale#> LoanAmount Loan_Amount_Term Total_Income Credit_History_0
#> 3 0.073610534 0.2504782 0.08043537 -0.4402648
#> 4 0.614136098 0.2504782 0.85443337 -0.4402648
#> 5 -0.569872281 0.2504782 -0.71768154 -0.4402648
#> 7 1.309097538 0.2504782 0.51238326 -0.4402648
#> 9 -0.209521904 0.2504782 -0.35882128 -0.4402648
#> 10 -0.080825341 0.2504782 -0.10082195 -0.4402648
#> 11 -2.577538662 -3.6973571 -1.78768444 -0.4402648
#> 12 0.202307097 0.2504782 0.08701326 -0.4402648
#> 17 -0.132303967 0.2504782 0.32966419 2.2627565
#> 23 -0.286739842 0.2504782 -0.10155282 -0.4402648
#> 24 -0.080825341 0.2504782 0.24561342 -0.4402648
#> 25 0.691354036 0.2504782 0.21564749 -0.4402648
#> 27 -0.955961969 0.2504782 -0.89966974 -0.4402648
#> 48 0.485439535 0.2504782 -0.38805633 2.2627565
#> 49 1.412054788 0.2504782 1.05469347 -0.4402648
#> 53 -0.106564654 2.2243959 -0.79003830 -0.4402648
#> 73 2.544584542 0.2504782 1.87766018 2.2627565
#> 87 -0.853004719 0.2504782 -0.18121834 -0.4402648
#> 91 1.309097538 0.2504782 0.42833249 -0.4402648
#> 95 -2.242927598 0.2504782 -1.36523794 -0.4402648
#> 98 0.897268537 0.2504782 1.09342992 -0.4402648
#> 99 -0.106564654 0.2504782 -1.86369557 -0.4402648
#> 100 -1.728141347 0.2504782 -1.15255294 -0.4402648
#> 108 -1.393530283 2.2243959 -1.89585413 2.2627565
#> 113 -0.415436405 0.2504782 -0.85289366 -0.4402648
#> 122 -1.058919220 0.2504782 -0.71110366 -0.4402648
#> 132 -1.496487534 0.2504782 -1.71898207 -0.4402648
#> 136 0.331003660 0.2504782 0.63736311 -0.4402648
#> 147 0.536918161 0.2504782 -0.07012514 -0.4402648
#> 151 -0.338218467 0.2504782 -0.20899164 -0.4402648
#> 155 -0.132303967 0.2504782 -0.74472396 -0.4402648
#> 162 -0.698568843 0.2504782 -0.84339226 -0.4402648
#> 163 2.158494853 0.2504782 0.53138605 2.2627565
#> 169 0.433960910 0.2504782 -0.41071350 -0.4402648
#> 170 0.974486475 -5.0790995 -0.98664402 -0.4402648
#> 173 0.871529224 0.2504782 1.98144462 -0.4402648
#> 176 0.459700223 0.2504782 -1.03268923 -0.4402648
#> 177 -0.698568843 0.2504782 -0.28865716 -0.4402648
#> 189 -0.724308156 0.2504782 1.00864827 -0.4402648
#> 192 -0.055086029 0.2504782 -0.51669056 -0.4402648
#> 194 0.253785722 0.2504782 -0.31496870 2.2627565
#> 200 0.459700223 0.2504782 -0.12055561 -0.4402648
#> 201 0.871529224 2.2243959 1.98071374 -0.4402648
#> 211 -0.158043279 0.2504782 0.36620800 -0.4402648
#> 219 1.720926539 0.2504782 2.30010668 -0.4402648
#> 220 -0.183782592 0.2504782 1.42744039 -0.4402648
#> 222 -0.003607404 0.2504782 0.98379847 -0.4402648
#> 225 1.103183037 0.2504782 0.80911904 -0.4402648
#> 229 -1.831098597 0.2504782 -1.79280057 -0.4402648
#> 233 1.103183037 0.2504782 0.09212939 -0.4402648
#> 248 0.974486475 0.2504782 0.25876919 -0.4402648
#> 249 -0.158043279 -2.7103983 0.15937001 2.2627565
#> 254 0.279525035 0.2504782 0.73237703 -0.4402648
#> 256 0.279525035 0.2504782 1.70736601 -0.4402648
#> 258 -0.106564654 -2.7103983 -0.85070103 -0.4402648
#> 260 0.485439535 0.2504782 -0.05916200 -0.4402648
#> 266 -0.209521904 0.2504782 0.53869481 -0.4402648
#> 279 -1.470748221 0.2504782 -1.34184990 -0.4402648
#> 280 1.103183037 0.2504782 -1.63054603 -0.4402648
#> 297 -0.183782592 0.2504782 -0.13078788 -0.4402648
#> 306 1.437794101 0.2504782 1.05688610 -0.4402648
#> 309 0.923007849 0.2504782 0.27192496 -0.4402648
#> 318 -0.441175718 0.2504782 -0.10228370 -0.4402648
#> 322 0.279525035 0.2504782 0.29385125 -0.4402648
#> 325 1.103183037 0.2504782 1.34119698 -0.4402648
#> 329 -0.312479155 -4.2895324 -0.43263979 -0.4402648
#> 331 -0.569872281 0.2504782 -1.41493753 -0.4402648
#> Credit_History_1 Property_Area_Rural Property_Area_Semiurban
#> 3 0.4402648 -0.6463485 -0.8175236
#> 4 0.4402648 -0.6463485 -0.8175236
#> 5 0.4402648 -0.6463485 -0.8175236
#> 7 0.4402648 -0.6463485 -0.8175236
#> 9 0.4402648 -0.6463485 -0.8175236
#> 10 0.4402648 1.5412926 -0.8175236
#> 11 0.4402648 -0.6463485 -0.8175236
#> 12 0.4402648 -0.6463485 -0.8175236
#> 17 -2.2627565 1.5412926 -0.8175236
#> 23 0.4402648 1.5412926 -0.8175236
#> 24 0.4402648 -0.6463485 1.2185729
#> 25 0.4402648 -0.6463485 1.2185729
#> 27 0.4402648 -0.6463485 -0.8175236
#> 48 -2.2627565 -0.6463485 1.2185729
#> 49 0.4402648 -0.6463485 -0.8175236
#> 53 0.4402648 -0.6463485 -0.8175236
#> 73 -2.2627565 -0.6463485 -0.8175236
#> 87 0.4402648 1.5412926 -0.8175236
#> 91 0.4402648 -0.6463485 -0.8175236
#> 95 0.4402648 -0.6463485 -0.8175236
#> 98 0.4402648 1.5412926 -0.8175236
#> 99 0.4402648 1.5412926 -0.8175236
#> 100 0.4402648 -0.6463485 -0.8175236
#> 108 -2.2627565 -0.6463485 1.2185729
#> 113 0.4402648 1.5412926 -0.8175236
#> 122 0.4402648 -0.6463485 1.2185729
#> 132 0.4402648 -0.6463485 -0.8175236
#> 136 0.4402648 1.5412926 -0.8175236
#> 147 0.4402648 -0.6463485 -0.8175236
#> 151 0.4402648 -0.6463485 1.2185729
#> 155 0.4402648 1.5412926 -0.8175236
#> 162 0.4402648 -0.6463485 -0.8175236
#> 163 -2.2627565 -0.6463485 1.2185729
#> 169 0.4402648 -0.6463485 1.2185729
#> 170 0.4402648 -0.6463485 1.2185729
#> 173 0.4402648 1.5412926 -0.8175236
#> 176 0.4402648 1.5412926 -0.8175236
#> 177 0.4402648 -0.6463485 -0.8175236
#> 189 0.4402648 1.5412926 -0.8175236
#> 192 0.4402648 -0.6463485 1.2185729
#> 194 -2.2627565 -0.6463485 1.2185729
#> 200 0.4402648 -0.6463485 -0.8175236
#> 201 0.4402648 1.5412926 -0.8175236
#> 211 0.4402648 -0.6463485 1.2185729
#> 219 0.4402648 -0.6463485 1.2185729
#> 220 0.4402648 -0.6463485 -0.8175236
#> 222 0.4402648 -0.6463485 -0.8175236
#> 225 0.4402648 1.5412926 -0.8175236
#> 229 0.4402648 1.5412926 -0.8175236
#> 233 0.4402648 -0.6463485 1.2185729
#> 248 0.4402648 -0.6463485 1.2185729
#> 249 -2.2627565 -0.6463485 -0.8175236
#> 254 0.4402648 -0.6463485 1.2185729
#> 256 0.4402648 -0.6463485 -0.8175236
#> 258 0.4402648 -0.6463485 -0.8175236
#> 260 0.4402648 -0.6463485 -0.8175236
#> 266 0.4402648 1.5412926 -0.8175236
#> 279 0.4402648 -0.6463485 -0.8175236
#> 280 0.4402648 -0.6463485 -0.8175236
#> 297 0.4402648 1.5412926 -0.8175236
#> 306 0.4402648 -0.6463485 -0.8175236
#> 309 0.4402648 1.5412926 -0.8175236
#> 318 0.4402648 -0.6463485 1.2185729
#> 322 0.4402648 -0.6463485 1.2185729
#> 325 0.4402648 -0.6463485 1.2185729
#> 329 0.4402648 -0.6463485 1.2185729
#> 331 0.4402648 -0.6463485 1.2185729
#> Property_Area_Urban
#> 3 1.5137001
#> 4 1.5137001
#> 5 1.5137001
#> 7 1.5137001
#> 9 1.5137001
#> 10 -0.6581305
#> 11 1.5137001
#> 12 1.5137001
#> 17 -0.6581305
#> 23 -0.6581305
#> 24 -0.6581305
#> 25 -0.6581305
#> 27 1.5137001
#> 48 -0.6581305
#> 49 1.5137001
#> 53 1.5137001
#> 73 1.5137001
#> 87 -0.6581305
#> 91 1.5137001
#> 95 1.5137001
#> 98 -0.6581305
#> 99 -0.6581305
#> 100 1.5137001
#> 108 -0.6581305
#> 113 -0.6581305
#> 122 -0.6581305
#> 132 1.5137001
#> 136 -0.6581305
#> 147 1.5137001
#> 151 -0.6581305
#> 155 -0.6581305
#> 162 1.5137001
#> 163 -0.6581305
#> 169 -0.6581305
#> 170 -0.6581305
#> 173 -0.6581305
#> 176 -0.6581305
#> 177 1.5137001
#> 189 -0.6581305
#> 192 -0.6581305
#> 194 -0.6581305
#> 200 1.5137001
#> 201 -0.6581305
#> 211 -0.6581305
#> 219 -0.6581305
#> 220 1.5137001
#> 222 1.5137001
#> 225 -0.6581305
#> 229 -0.6581305
#> 233 -0.6581305
#> 248 -0.6581305
#> 249 1.5137001
#> 254 -0.6581305
#> 256 1.5137001
#> 258 1.5137001
#> 260 1.5137001
#> 266 -0.6581305
#> 279 1.5137001
#> 280 1.5137001
#> 297 -0.6581305
#> 306 1.5137001
#> 309 -0.6581305
#> 318 -0.6581305
#> 322 -0.6581305
#> 325 -0.6581305
#> 329 -0.6581305
#> 331 -0.6581305
#> attr(,"scaled:center")
#> LoanAmount Loan_Amount_Term Total_Income
#> 117.1401515 344.7727273 4830.9466666
#> Credit_History_0 Credit_History_1 Property_Area_Rural
#> 0.1628788 0.8371212 0.2954545
#> Property_Area_Semiurban Property_Area_Urban
#> 0.4015152 0.3030303
#> attr(,"scaled:scale")
#> LoanAmount Loan_Amount_Term Total_Income
#> 38.8510764 60.7928091 1368.2205953
#> Credit_History_0 Credit_History_1 Property_Area_Rural
#> 0.3699564 0.3699564 0.4571134
#> Property_Area_Semiurban Property_Area_Urban
#> 0.4911359 0.4604411
Now, let’s calculate the most optimum K value.
# find optimum k
sqrt(nrow(train_x))#> [1] 16.24808
K -> 16
Next, let’s start predicting using the knn()
function.
loan_pred <- knn(train = train_x_scale,
test = test_x_scale,
cl = train_y,
k = 16)
loan_pred#> [1] Y Y Y Y Y Y Y Y N Y Y Y Y N Y Y N Y Y Y Y Y Y N Y Y Y Y Y Y Y Y N Y Y Y Y Y
#> [39] Y Y N Y Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
#> Levels: N Y
# confusion matrix
confusionMatrix(data = loan_pred,
reference = test_y,
positive = "Y")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction N Y
#> N 7 0
#> Y 14 46
#>
#> Accuracy : 0.791
#> 95% CI : (0.6743, 0.8808)
#> No Information Rate : 0.6866
#> P-Value [Acc > NIR] : 0.039911
#>
#> Kappa : 0.4071
#>
#> Mcnemar's Test P-Value : 0.000512
#>
#> Sensitivity : 1.0000
#> Specificity : 0.3333
#> Pos Pred Value : 0.7667
#> Neg Pred Value : 1.0000
#> Prevalence : 0.6866
#> Detection Rate : 0.6866
#> Detection Prevalence : 0.8955
#> Balanced Accuracy : 0.6667
#>
#> 'Positive' Class : Y
#>
Based on the above analysis: True Positive (TP): 46 False Positive (FP): 14 True Negative (TN): 7 False Negative (FN): 0
Now, let’s double check:
# Recall = TP/(TP+FN)
RecallKNN <- round(46 / (46 + 0),2)
# Specificity = TN/(TN+FP)
SpecificityKNN <- round(7 / (7 + 14), 2)
# Accuracy = TP+TN/TOTAL
AccuracyKNN <- round((46 + 7) / (46+14+7), 2)
# Precision = TP/(TP+FP)
PrecisionKNN <- round(46 / (46 + 14), 2)
performance <- cbind.data.frame(RecallKNN, SpecificityKNN, AccuracyKNN, PrecisionKNN)
performanceLet’s compare the two models (Logistic Regression vs. K-NN):
| Comparison | Logistic Regression | KNN |
|---|---|---|
| Recall | 100% | 100% |
| Specificity | 33% | 33% |
| Accuracy | 79% | 79% |
| Precision | 77% | 77% |
From the result, we can conclude that if we use exactly the same model (
model_optimum), regardless of the method whether it’s logistic regression or KNN, you’ll bound to get the same result.
We can fairly assume that both Logistic Regression and KNN models perform equally good when doing classification for predicting loan eligibility for our Bank’s customers. However, we should take our model with a grain of salt as the class balance was not ideal (not 1:1, but rather 1:2). This may be the reason behind the low score for Specificity or predicting unapproved loan. This warrants for model improvement using either upSampling/downSampling/ROSE/SMOTE which will be further discussed in the next module.