Bank Loan Approval
In Blog 3 I have decided to work on Binary Logistic Regression. For this purpose I will use Bank Loan Approval data set from the open source and link below:
https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc/data
Loan <- read.csv("https://raw.githubusercontent.com/uplotnik/Health/main/bankloan.csv")
head(Loan)
## ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1 1 25 1 49 91107 4 1.6 1 0
## 2 2 45 19 34 90089 3 1.5 1 0
## 3 3 39 15 11 94720 1 1.0 1 0
## 4 4 35 9 100 94112 1 2.7 2 0
## 5 5 35 8 45 91330 4 1.0 2 0
## 6 6 37 13 29 92121 4 0.4 2 155
## Personal.Loan Securities.Account CD.Account Online CreditCard
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 1
## 6 0 0 0 1 0
ID : ID of the customer.
Age : Age of the customer in years.
Experience : Amount of work experience in years.
Income : Amount of annual income (in thousands).
Zipcode : Postal code of the city in which the client lives.
Family : Number of family members.
CCAvg : Average monthly spending with the credit card (in thousands).
Education : Education level
1 : Bachelor's degree
2 : Master's degree
3 : Advanced/Professional degree.
Mortgage: Value of home mortgage, if any (in thousands).
Securities Account : Does the customer have a securities account with the bank?
CD Account : Does the customer have a certificate of deposit account (CD) with the bank?
Online : Does the customer use the internet banking facilities?
CreditCard : Does the customer use a credit card issued by the bank?
Personal Loan : Did this customer accept the personal loan offered in the last campaign? ->
str(Loan)
## 'data.frame': 5000 obs. of 14 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
## $ Family : int 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal.Loan : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Securities.Account: int 1 1 0 0 0 0 0 0 0 0 ...
## $ CD.Account : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Online : int 0 0 0 0 0 1 1 0 1 0 ...
## $ CreditCard : int 0 0 0 0 1 0 0 1 0 0 ...
plot_intro(Loan,
ggtheme = theme_minimal(),
title = "Missing Data",
theme_config = theme(plot.title = element_text(color = "orange")),
geom_label_args = c(hjust = "inward")
)
#Extract duplicate elements:
Loan[duplicated(Loan)]
## data frame with 0 columns and 5000 rows
After checking dataset for missing or duplicate values across its columns, we observe that dataset is clean and complete, so we have reliable foundation for further analysis and modeling.
plot_bar(Loan)
From the visualization above we can see that the majority of the bank’s clients declined the offered personal loan and only a smaller amount of clients accepted it.
We also can observe that the most clients don’t have a CD account and securities accounts. Also,the majority of clients prefer don’t use the bank’s credit cards. However, significant number of clients actively use the bank’s online banking services, so we can assume that bank has convenient and reliable online program.
#Correlation table
library(corrr)
cor<-Loan %>%
correlate() %>%
focus(Personal.Loan) %>%arrange(desc(Personal.Loan))
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
knitr::kable(
head(cor), caption = "Correlation Table")%>%
kable_styling("striped", full_width = F)
| term | Personal.Loan |
|---|---|
| Income | 0.5024623 |
| CCAvg | 0.3668887 |
| CD.Account | 0.3163548 |
| Mortgage | 0.1420952 |
| Education | 0.1367216 |
| Family | 0.0613670 |
From the correlation table above we can see strong and weak influences on personal loan decisions.
Strong Influences:
Weaker Influences:
Negative Correlation:
## View bivariate continuous distribution based on `cut`
plot_boxplot(Loan, by = "Personal.Loan")
Boxplots reveal that CCAvg, Mortgage, and Income columns exhibit a significant presence of outliers.
plot_histogram(Loan)
We can observe that the most of the columns are normally distributed except for the Income , CCavg , Mortgage due to them having high number of outliers.
The age and experience columns match together with the same frequencies where higher ages values have the same frequncy of higher experinces and the opposite.
The right-skewed distribution in the ‘Income’ column indicates that the majority of clients within this bank tend to have relatively lower incomes.
CCAvg (Average Credit Card Spending) : The right-skewed distribution in the ‘CCAvg’ column implies that most clients exhibit lower average spending on credit cards with zero is the most frequent value.
The right-skewed distribution of the ‘Mortgage’ column suggests that the majority of clients in this bank do not hold mortgage loans. The majority of bank’s clients have bachelor’s degrees. and the majority of clients seem to belong to households with only one family member.
Compute skewness:
library(moments)
##
## Attaching package: 'moments'
## The following object is masked from 'package:lessR':
##
## kurtosis
skewness(Loan$Income, na.rm = TRUE)
## [1] 0.8410862
skewness(Loan$CCAvg, na.rm = TRUE)
## [1] 1.597964
skewness(Loan$Mortgage, na.rm = TRUE)
## [1] 2.103371
income<-Loan$Income
log_income <- log10(Loan$Income)
sqrt_income <- sqrt(Loan$Income)
recip_income <- 1/Loan$Income
inc_data <- data.frame(income, log_income,sqrt_income,recip_income)
plot_histogram(inc_data)
ccaverage<-Loan$CCAvg
log_ccaverage <- log10(Loan$CCAvg)
sqrt_ccaverage <- sqrt(Loan$CCAvg)
recip_ccaverage <- 1/Loan$CCAvg
ccaverage_data <- data.frame(ccaverage, log_ccaverage,sqrt_ccaverage,recip_ccaverage)
plot_histogram(ccaverage_data)
mortgage<-Loan$Mortgage
log_mortgage <- log10(Loan$Mortgage)
sqrt_mortgage <- sqrt(Loan$Mortgage)
recip_mortgage <- 1/Loan$Mortgage
mortgage_data <- data.frame(mortgage, log_mortgage,sqrt_mortgage,recip_mortgage)
plot_histogram(mortgage_data)
The square root transformation appears to be the most suitable method
for Income and CCAvg Column The Reciprocal Transformation seems to yield
better results for Mortgage Column.
Loan$Income <- sqrt_income
Loan$CCAvg <- sqrt_ccaverage
Loan$Mortgage <- recip_mortgage
We’ll apply the Standardization method to scale our data, enhancing the model’s performance and ensuring standardized ranges of values for improved processing.
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:lessR':
##
## set
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
setnames(Loan, "Personal.Loan", "loan")
setnames(Loan, "CD.Account", "CDAccount")
df3 <- Loan %>% mutate_at(c('Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage'), ~(scale(.) %>% as.vector))
Based on the Analysis and Correlation results, the selected features for inclusion are Income, CCAvg, Mortgage, Family, Education, CreditCard, CDAccount, Age, and Experience.
df = subset(df3, select = -c(ZIP.Code,Online,Securities.Account) )
RNGkind(sample.kind = "Rounding")
set.seed(100)
index_loan <- sample(nrow(df) , size = nrow(df)*0.8)
loan_train <- df[index_loan,]
loan_test <- df[-index_loan, ]
dftrain <- data.frame(loan_train)
dftest <- data.frame(loan_test)
For our initial logistic regression model, we will focus on the Credit_History variable, as it appears to be a key factor in determining loan approval.
# Training the model
logistic_model <- glm(loan ~ Income, family = 'binomial', data=dftrain)
# Checking the model
summary(logistic_model)
##
## Call:
## glm(formula = loan ~ Income, family = "binomial", data = dftrain)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.81541 0.13141 -29.04 <0.0000000000000002 ***
## Income 2.12680 0.09559 22.25 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2452.5 on 3999 degrees of freedom
## Residual deviance: 1552.3 on 3998 degrees of freedom
## AIC: 1556.3
##
## Number of Fisher Scoring iterations: 7
my_prediction_tr1 <- predict(logistic_model, newdata = dftrain, type = "response")
table(dftrain$loan, my_prediction_tr1 > 0.5)
##
## FALSE TRUE
## 0 3530 103
## 1 257 110
accuracy = (3530 + 110) / (3530+103+257+110)
accuracy
## [1] 0.91
# Training the model
logistic_model2 <- glm(loan ~ Income, family = 'binomial', data=dftest)
# Checking the model
summary(logistic_model2)
##
## Call:
## glm(formula = loan ~ Income, family = "binomial", data = dftest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7386 0.2592 -14.42 <0.0000000000000002 ***
## Income 2.3164 0.1918 12.08 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 705.48 on 999 degrees of freedom
## Residual deviance: 405.34 on 998 degrees of freedom
## AIC: 409.34
##
## Number of Fisher Scoring iterations: 7
my_prediction_tr2 <- predict(logistic_model2, newdata = dftest, type = "response")
table(dftest$loan, my_prediction_tr2 > 0.5)
##
## FALSE TRUE
## 0 857 30
## 1 67 46
accuracy2 = (857 + 46) / (30+67+857 + 46)
accuracy2
## [1] 0.903
To evaluate the model’s accuracy, we have generated a confusion table for both the training and test data:
Train data: 91% accuracy
Test data: 90.3% accuracy
To further improve the model, we can incorporate additional variables and assess their impact on accuracy.
# Training the model
logistic_model3 <- glm(loan ~ Income+CCAvg+CDAccount, family = 'binomial', data=dftrain)
# Checking the model
summary(logistic_model3)
##
## Call:
## glm(formula = loan ~ Income + CCAvg + CDAccount, family = "binomial",
## data = dftrain)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14251 0.14623 -28.329 < 0.0000000000000002 ***
## Income 2.00133 0.11017 18.166 < 0.0000000000000002 ***
## CCAvg 0.17433 0.06741 2.586 0.00971 **
## CDAccount 2.41364 0.19963 12.091 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2452.5 on 3999 degrees of freedom
## Residual deviance: 1388.5 on 3996 degrees of freedom
## AIC: 1396.5
##
## Number of Fisher Scoring iterations: 7
my_prediction_tr3 <- predict(logistic_model3, newdata = dftrain, type = "response")
table(dftrain$loan, my_prediction_tr3 > 0.5)
##
## FALSE TRUE
## 0 3551 82
## 1 229 138
accuracy3 = (3547 + 145) / (86+222+3547 + 145)
accuracy3
## [1] 0.923
# Testing the model
logistic_model4 <- glm(loan ~ Income+CCAvg+CDAccount, family = 'binomial', data=dftest)
# Checking the model
summary(logistic_model4)
##
## Call:
## glm(formula = loan ~ Income + CCAvg + CDAccount, family = "binomial",
## data = dftest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8392 0.2657 -14.449 < 0.0000000000000002 ***
## Income 2.1173 0.2098 10.094 < 0.0000000000000002 ***
## CCAvg 0.2037 0.1336 1.525 0.127354
## CDAccount 1.5254 0.3971 3.842 0.000122 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 705.48 on 999 degrees of freedom
## Residual deviance: 387.08 on 996 degrees of freedom
## AIC: 395.08
##
## Number of Fisher Scoring iterations: 7
my_prediction_tr4 <- predict(logistic_model4, newdata = dftest, type = "response")
table(dftest$loan, my_prediction_tr4 > 0.5)
##
## FALSE TRUE
## 0 859 28
## 1 58 55
accuracy4 = (860 + 54) / (59+27+860 + 54)
accuracy4
## [1] 0.914
Train data: 92.3%
Test data: 91.4%
We observe that incorporating additional variables into the model has led to an improvement in the accuracy of the test set. This highlights the value of including relevant predictors when building a predictive model to achieve better performance on unseen data.
library(rpart)
# grow tree
dtree <- rpart(loan ~ Income+CCAvg+CDAccount,method="class", data=dftrain,parms=list(split="information"))
dtree$cptable
## CP nsplit rel error xerror xstd
## 1 0.10490463 0 1.0000000 1.0000000 0.04974732
## 2 0.01089918 2 0.7901907 0.7956403 0.04482964
## 3 0.01000000 4 0.7683924 0.7929155 0.04475884
plotcp(dtree)
dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")
dtree.pred <- predict(dtree.pruned, dftrain, type="class")
dtree.perf <- table(dftrain$loan, dtree.pred,
dnn=c("Actual", "Predicted"))
dtree.perf
## Predicted
## Actual 0 1
## 0 3604 29
## 1 261 106
acc=(3604+106)/(3604+106+261+29)
acc
## [1] 0.9275
Lastly, we generate a confusion table to assess the model’s accuracy, applying the same steps to the test data.
dtree_test <- rpart(loan ~ Income+CCAvg+CDAccount,method="class", data=dftest,parms=list(split="information"))
dtree_test$cptable
## CP nsplit rel error xerror xstd
## 1 0.07522124 0 1.0000000 1.0000000 0.08859773
## 2 0.05899705 2 0.8495575 0.9469027 0.08650459
## 3 0.01000000 5 0.6725664 0.8053097 0.08048671
plotcp(dtree_test)
dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")
dtree_test.pred <- predict(dtree_test.pruned, dftest, type="class")
dtree_test.perf <- table(dftest$loan, dtree_test.pred,
dnn=c("Actual", "Predicted"))
dtree_test.perf
## Predicted
## Actual 0 1
## 0 869 18
## 1 58 55
acc4=(869+55)/(869+55+18+58)
acc4
## [1] 0.924
Accuracy:
Train data: 92.75%
Test data: 92.4%
We can observe an improved performance compared to the logistic regression model, indicating a more accurate prediction of personal loan approval status.