Introduction
In the current world of high power data analytics and breadth of
borrower data available for utilization, ensuring efficient borrower
population segmentation and classification is central to the operating
processes of any effective lending institution. Financial institutions
and credit agencies must prioritize creation of accurate and detailed
borrower profiles to accurately predict borrower outcome to ensure
timely and complete repayment of loans as well as prevent fraudulent
borrowing that will result in a complete financial loss for the
institution.
Through detailed customer segmentation lending institutions and
credit agencies can streamline and improve their repayment rates and
overall profit by:
- Identifying borrower populations with greater repayment probability
for targeted reduced APR loan marketing
- Improving prediction accuracy to identify which customers are likely
to go into default and provide preemptive mitigation strategies such as
refinancing or adjusted repayment plans
- Develop better live spending limit adjustments to support borrower
needs and repayment capabilities
- Improve fraud detection to reduce the number of fraudulent
application approvals
Loan default data was obtained from
Applied
Analytics through Case Studies Using SAS and R, Deepti Gupta for
analysis. Each of the below outlined variables as included in the data
set serves as a categorical data point captured to help classify
borrowers and predict their repayment ability.
Variables
Variable Name
|
Variable Type
|
Details
|
Loan Status
|
Categorical
|
Status of bank loan default (Default vs Current)
|
Checking Amount
|
Numeric
|
Amount in borrower’s checking account
|
Term
|
Numeric
|
Loan term in months
|
Credit Score
|
Numeric
|
Borrower’s credit score
|
Gender
|
Categorical
|
Borrower’s gender
|
Marital Status
|
Categorical
|
Borrower’s marital status
|
Employment Status
|
Categorical
|
Borrower’s employment status
|
Amount
|
Numeric
|
Loan amount
|
Saving Amount
|
Numeric
|
Ammount in borrower’s saving account
|
Age
|
Numeric
|
Duration of borrower’s employment in months
|
Number of Credit Accounts
|
Numeric
|
Number of credit accounts in borrower’s name
|
Car Loan
|
Categorical
|
If borrower holds a car loan
|
Personal Loan
|
Categorical
|
If borrower holds a personal loan
|
Home Loan
|
Categorical
|
If borrower holds a home loan
|
Education Loan
|
Categorical
|
If borrower holds an education loan
|
Any Loan
|
Categorical
|
A feature variable measuring how many of the 4 defined loans held by the
borrower (personal, home, education, or car)
|
Total Debt
|
Numeric
|
A feature variable measuring the total number of borrower’s debts (4
defined loans and Number of Credit Accounts)
|
raw.loan <- read.csv("https://nlepera.github.io/sta551/HW04/data/BankLoanDefaultDataset.csv")
raw.loan$Any_loan <- rowSums(raw.loan[c(7:10)] == 1)
raw.loan$Total_debt <- as.numeric(raw.loan$Any_loan) + as.numeric(raw.loan$No_of_credit_acc)
raw.loan$Loan_status <- factor(raw.loan$Default, levels = c(0:1), labels = c("Current","Default"))
raw.loan$Car_loan <- factor(raw.loan$Car_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Personal_loan <- factor(raw.loan$Personal_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Home_loan <- factor(raw.loan$Home_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Education_loan <- factor(raw.loan$Education_loan, levels = c(0:1), labels = c("No", "Yes"))
EDA & Feature
Variable Generation
In order to more efficiently utilize this data for borrower analysis,
the number of variables needs to be reduced in a process called
dimension reduction. The dimensions of this data set were reduced by
first combining the variables Car Loan
,
Personal Loan
, Home Loan
,
Education Loan
and Number of Credit Accounts
to create a new variable entitled Total Debt
that captures
the total count of loans and credit accounts for each borrower. This
effectively reduces five variables into a single variable, with minimal
loss of information.
loan <- raw.loan[-1] %>%
relocate(Loan_status, .before = Checking_amount) %>%
relocate(Amount, .after = Loan_status) %>%
relocate(Term, .after = Amount) %>%
relocate(Total_debt, .after = Term) %>%
relocate(No_of_credit_acc, .before = Car_loan)
kable(head(loan), caption = "A breif glimpse at the borrower data as collected" ) %>%
kable_styling(full_width = FALSE) %>%
scroll_box(width ="100%")
A breif glimpse at the borrower data as collected
Loan_status
|
Amount
|
Term
|
Total_debt
|
Checking_amount
|
Credit_score
|
Gender
|
Marital_status
|
No_of_credit_acc
|
Car_loan
|
Personal_loan
|
Home_loan
|
Education_loan
|
Emp_status
|
Saving_amount
|
Emp_duration
|
Age
|
Any_loan
|
Current
|
1536
|
15
|
2
|
988
|
796
|
Female
|
Single
|
1
|
Yes
|
No
|
No
|
No
|
employed
|
3455
|
12
|
38
|
1
|
Current
|
947
|
15
|
2
|
458
|
813
|
Female
|
Single
|
1
|
Yes
|
No
|
No
|
No
|
employed
|
3600
|
25
|
36
|
1
|
Current
|
1678
|
14
|
2
|
158
|
756
|
Female
|
Single
|
1
|
No
|
Yes
|
No
|
No
|
employed
|
3093
|
43
|
34
|
1
|
Default
|
1804
|
25
|
2
|
300
|
737
|
Female
|
Single
|
1
|
No
|
No
|
No
|
Yes
|
employed
|
2449
|
0
|
29
|
1
|
Default
|
1184
|
24
|
2
|
63
|
662
|
Female
|
Single
|
1
|
No
|
No
|
No
|
Yes
|
unemployed
|
2867
|
4
|
30
|
1
|
Current
|
475
|
20
|
3
|
1071
|
828
|
Male
|
Married
|
2
|
Yes
|
No
|
No
|
No
|
employed
|
3282
|
12
|
32
|
1
|
The distribution of the Total Debt
feature variable is
included for review.
par(mfrow = c(1,2))
hist(x = loan$Total_debt,
prob = TRUE,
main = "Distribution of
Borrower Total Debt",
xlab = "Total Number of Debt Accounts")
plot(x = loan$Loan_status,
y = loan$Total_debt,
col = c("skyblue", "darkred"),
main = "Total Debt by Loan Status
(Current vs Default)",
xlab = "Borrower Loan Default Status
(Current vs Default)",
ylab = "Total Number of Borrower Debt Accounts")

K Means Cluster
Analysis (Full Data)
A rudimentary cluster analysis of the numeric variables (Loan status,
Amount, Term, Checking amount, Credit score, Saving amount, Employment
Duration, Age, Total debt) provides an initial glimpse at the borrower
profile data, and underscores the need for further dimension reduction
prior to cluster analysis.
When assessing fo the optional cluster number utilizing the
silhouette method, it demonstrated that the optimal number of clusters
is two. Such a small number of clusters indicates a poor cluster model
fit and lacks predictive power. As demonstrated by the significant
overlap in the two clusters, a high level cluster analysis of all
numeric borrower profile data does not provide proper borrower
segmentation for identification and loan default prediction.
Utilizing such a poorly fit model for various business purposes could
result in ineffective targeted marketing, poor pre-approval models
resulting in higher back end costs reviewing and rejecting poorly fit
applicants, and overall poor risk analysis in initial borrower
requests.
loan.cluster <- loan[, c(1:6, 15:17)]
loan.cluster$Loan_status <- as.numeric(loan.cluster$Loan_status)
loan.cluster.group <- kmeans(x = loan.cluster, centers = 2)
loan.clustID <- loan.cluster.group$cluster
plot1 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "wss")
plot2 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot1,plot2)

clusplot(loan.cluster,
loan.clustID,
lines = 0,
shade = T,
color = T,
labels = 1,
plotchar = F,
span = T)

Heirarchal Data
Clustering - Agglomerative
The following variables were analyzed to determine the borrower’s
repayment profile:
- Amount in borrower’s saving account
Saving_amount
- Borrower’s credit score
Credit_score
- Loan amount
Amount
- Status of bank loan default (Default vs Current)
Loan_status
Utilizing a ‘bottom-up’ approach the data was split into clusters
starting with single data points and successively merging clusters. This
allows for tuning of the cluster sizes without pre-determining cluster
count. Therefore, overall borrower profiles can be created from the data
without pre-determining segments allowing in improved borrower profiling
and prediction.
All data was run both raw and scaled. Raw data did not present
significant skew, and scaled data presented over fitting issues
resulting in a best fit of two clusters. No missing values were
identified in the data set, therefore no imputation was required.
heirarch.loans <- loan[, c("Checking_amount", "Amount", "Emp_duration", "Loan_status")]
heirarch.distance <- dist(heirarch.loans, method = "euclidean")
hcluster1 <- hclust(heirarch.distance, method = "complete")
Determining Cluster
Count
By examining the current borrower data to determine both the number
and dimensions of the clusters, the borrower profile clusters can be
created to better align with the current borrower profile base. This
will allow for improved accuracy in borrower outcome prediction over the
life cycle of the loan. For example, as a loan progresses, if a borrower
originally profiled as a high likelihood of repayment suddenly has a dip
in employment duration (i.e. laid off), accurate and timely reassessment
of the borrower profile will allow for rapid revocation of any pending
lending offers or credit increases, as well as flag the borrower for
targeted repayment and or refinancing programs to reduce risk of loan
delinquency
plot3 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "wss")
plot4 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot3, plot4)

Final Cluster
analysis
Optimal clustering was indicated as 3 clusters via the silhouette
method and was selected as the pre-defined number of clusters for
hierarchical agglomerate cluster analysis. Unlike the clustering as seen
in the initial k-means cluster analysis, by utilizing hierarchical
agglomerate clustering, the clusters are created before the best fit
number of clusters is selected. This unsupervised algorithm allows for
the clustering to be determined directly from the data itself rather
than being pre-assigned based on assumptions.
This hierarchical clustering is visually represented in the below
dendrogram. The clusters as selected are highlighted visually with
different color square overlays. Inside the bounds of each box are the
borrowers falling into that cluster. A scatterplot is also included to
demonstrate the mapping of these clusters as assigned by the hclust()
function.
hcluster2 <- hcluster1
hCluster_group <- cutree(hcluster2, k = 3)
heirarch.loans$hCluster_group = hCluster_group
plot(hcluster2, cex = 0.6, labels = FALSE, hang = -1, xlab = "", main = "Dendogram of Borrower Profile Clustering")
par(lwd = 3)
rect.hclust(hcluster1, k = 3, border = 2:9)

par(lwd=1)
clusplot(heirarch.loans,
hCluster_group,
lines = 0,
shade = T,
color = T,
labels = 1,
plotchar = F,
span = T,
main = "Cluster Plot of Raw Borrower Data")

Principal Component
Analysis (PCA) & Reducing dimensions
To reduce, the dimensions must be scaled. All data transformed via
scale()
function as log scaling not possible with
employment duration as some duration = 0 months. (“Checking_amount”,
“Amount”, “Emp_duration”, “Loan_status”)
pca.loans.scale <- heirarch.loans[,-5]
pca.loans.scale$Checking_amount <- scale(pca.loans.scale$Checking_amount)
pca.loans.scale$Amount <- scale(pca.loans.scale$Amount)
pca.loans.scale$Emp_duration <- scale(pca.loans.scale$Emp_duration)
pca.loans.scale$Loan_status <- scale(as.numeric(pca.loans.scale$Loan_status))
pca.loans <- prcomp(pca.loans.scale, center = TRUE, scale = TRUE)
kable(round(pca.loans$rotation, 2), caption = "Factor loading of Borrower Profile PCA") %>%
kable_styling()
Factor loading of Borrower Profile PCA
|
PC1
|
PC2
|
PC3
|
PC4
|
Checking_amount
|
-0.65
|
-0.01
|
0.31
|
-0.69
|
Amount
|
0.30
|
0.61
|
0.73
|
0.04
|
Emp_duration
|
-0.21
|
0.79
|
-0.57
|
-0.07
|
Loan_status
|
0.67
|
-0.03
|
-0.20
|
-0.72
|
kable(round(summary(pca.loans)$importance, 3), caption = "Importance of each componant of Borrower Profile PCA") %>%
kable_styling()
Importance of each componant of Borrower Profile PCA
|
PC1
|
PC2
|
PC3
|
PC4
|
Standard deviation
|
1.243
|
1.009
|
0.949
|
0.733
|
Proportion of Variance
|
0.386
|
0.254
|
0.225
|
0.134
|
Cumulative Proportion
|
0.386
|
0.641
|
0.866
|
1.000
|
As demonstrated by the above PCA tables, the first three principal
components account for 86.6% of the variation in the borrower profile
data. The equation for each principal component is included below:
\[PC_1= -0.65[Checking.amount] +
0.30[Loan.amount] - 0.21[Employment.duration] +
0.67[Loan.status]\] \[PC_2=
-0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] -
0.03[Loan.status]\] \[PC_3=
0.31[Checking.amount] + 0.73[Loan.amount] - 0.57[Employment.duration] -
0.20[Loan.status]\]
PC1 most impacted by checking amount and loan status with light
impact from loan amount and employment duration, we will call this
measure Immediate Financial Status Index PC2 most impacted by loan
amount and employment duration with near 0 impact from checking amount
and loan status, we will call this measure Repayment Security Index
PC3 will be dropped ass PC2 and PC3 have same largest influence
variables (loan amount and employment duration), but PC2 has a greater
proportion of variance than PC3.
Determne Best Number
of Clusters
The same process was utilized to determine the best fit number of
clusters for the PCA variables, again outlining 3 clusters as the best
fit. Despite this best fit silhouette prediction, four clusters were
utilized to reduce overall cluster overlap.
pca.cluster <- data.frame(Imed_fin_stat_PC1 = pca.loans$x[,1],
Repay_sec_PC2 = pca.loans$x[,2])
plot5 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "wss")
plot6 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot5, plot6)

pca.distance <- dist(pca.cluster, method = "euclidean")
pca.clust <- hclust(pca.distance, method = "complete")
pca.cluster.group <- cutree(pca.clust, 4)
heirarch.loans$pcaCluster = as.character(pca.cluster.group)
clusplot(pca.cluster,
pca.cluster.group,
lines = 0,
shade = T,
color = T,
labels = 1,
xlab = "Immediate Financial Status Index",
ylab = "Repayment Security Index",
plotchar = F,
span = T,
main="Cluster Plot of PCA Components")

In reviewing the principal component variables are found to improve
borrower profile predictive capabilities through dimension reduction. By
combining four variables into two component analysis variables, the data
can be more easily manipulated for borrower outcome predictions at the
time of lending and increased targeted marketing for customers indicated
to have a strong financial status index and repayment security index.
This allows for borrower segmentation into four clusters:
- High immediate financial status & High Repayment Security
- High Immediate Financial Status & Low Repayment Security
- Low Immediate Financial Status & High Repayment Security
- Low Immediate Financial Status & Low Repayment Security
These four clusters that comprise the borrower profile can allow for
a targeted customer based approach. Those with high financial status and
high repayment security can be targeted for low interest rate loans with
a higher down payment marketing to entice further borrowing from a low
risk borrower. Where as those with low immediate financial status &
high repayment security can be targeted for micro to medium sized loans
with minimal to no down payments to entice further borrowing from a low
risk borrower that may not have the funds for a traditional high down
payment loan.
Outlier analysis
Outlier identification can provide further support in both effective
targeted marketing, appropriate borrower application denials, and
fraudulent borrower application identification. Outliers to the borrower
profile groups should be manually reviewed to determine if the borrower
is at risk for adjusting one of their principal component
classifications (ex: moving from high immediate financial status to low,
etc), if the borrower data is potentially fraudulent, or if the borrower
data indicates a borrower that requires potential account termination
and/or future borrower application denial for lack of repayment
ability.
Additionally, properly identifying outliers allows for removal of
outliers prior to PCA or hierarchical agglomerate clustering will result
in increased borrower repayment profile classification and improved
predictive capacity.
Local Outlier Factor
(LOF) Score
A local outlier factor (LOF) score is calculated utilizing the below
equation to compare a data point’s local reachability density (LRD) of
the nearest k neighbors to point \(A_i\) for \(i = 1
, 2, ... , n\).
\[
LOF(A_i) = \frac{\frac{\sum_{i = A_j \in N_kA_i}
{LRD_k(A_j)}}{||N_k(A_i)||}} {LRD_k(A_i)}
\]
The LOF factor acts as an easily filterable scale variable to quickly
identify outlier values. An LOF > 1 indicates a potential outlier,
with the greater values for LOF indiciating more extreme outliers.
lof.pca.cluster <- lof(pca.cluster, minPts = 50)
pca.cluster$LOF <- lof.pca.cluster
pander::pander(summary(lof.pca.cluster), caption = "Summary statistics of LOF scores for Immediate Financial Status Index & Repayment Security Index")
Summary statistics of LOF scores for Immediate Financial Status
Index & Repayment Security Index
0.9571 |
0.9939 |
1.017 |
1.064 |
1.079 |
2.315 |
Based on the summary statistics for the calculated LOF scores for the
PCA feature variables the outlier cutoff was selected as an LOF value of
1.8. This value was selected as the cutoff to ensure less than 1% of the
dataset is identified as an outlier, leading to an outlier flagging of 1
in every 100 borrowers. Selecting an LOF value of 1.7 resulted in >
1% outlier flagging. This 1% outlier flagging will scale well to the
lender or credit agency’s operational constraints until the need for
hyperparameter (k) tuning is required.
plot(x = pca.cluster$Imed_fin_stat_PC1,
y = pca.cluster$Repay_sec_PC2,
pch = "x",
cex = 0.5,
xlab = "Immediate Financial Status Index",
ylab = "Repayment Security Index",
main="Outlier Identification Amongst PCA Components")
points(pca.cluster,
cex = ((lof.pca.cluster - 1)*1.5),
pch = 21,
col = "hotpink")
text(pca.cluster[lof.pca.cluster > 1.8,],
labels = round(lof.pca.cluster, 1)[lof.pca.cluster >1.8],
pos = 2,
cex = 0.6,
col = "darkred")

kable(filter(pca.cluster, LOF > 1.8), caption = "listing of all LOF scores > 1.8") %>%
kable_styling()
listing of all LOF scores > 1.8
Imed_fin_stat_PC1
|
Repay_sec_PC2
|
LOF
|
-2.481105
|
-2.2764407
|
2.009484
|
3.396802
|
0.6592382
|
1.882003
|
-3.034711
|
-0.0338831
|
1.888966
|
-2.282020
|
-2.8414007
|
2.315474
|
2.926977
|
2.3990462
|
1.973623
|
Conclusions
Overall proper borrower profile segmentation will allow for improved
loan default prediction models, improved identification of fraudulent
pre-approval applications, and improved targeted marketing to drive up
borrowing rates from borrowers with a high repayment profile. Overall
this borrower segmentation and classification may also be used for
predictive analysis regarding borrower pre-approval determinations.
Overall borrower population segmentation remains a highly effective tool
for managing and predicting overall loan outcomes.
---
title: "Can you pay me back? :<img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Loan Default Borrower Data Clustering & Analysis"
author:
- name: Natalie LePera
  affiliation: West Chester University | STA551 - HW 04
date: "12 Dec 2024"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
}

body {
  background-color:white;
}

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
}

h5 {
  color: navy;
}

.iframe {
  text-align: center;
}

a:link {
  color: darkmagenta;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
}

.td1 {
  font-weight: bold;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
if (!require("dplyr")) {
    install.packages("dplyr")              
    library("dplyr")
}

if (!require("plyr")) {
    install.packages("plyr")              
    library("plyr")
}

if (!require("stringr")) {
    install.packages("stringr")              
    library("stringr")
}

if (!require("plotly")) {
    install.packages("plotly")              
    library("plotly")
}

if (!require("pandoc")) {
    install.packages("pandoc")              
    library("pandoc")
}

if (!require("gridExtra")) {
    install.packages("gridExtra")              
    library("gridExtra")
}

if (!require("grid")) {
    install.packages("grid")              
    library("grid")
}
if (!require("raster")) {
    install.packages("raster")              
    library("raster")
}
if (!require("dbscan")) {
    install.packages("dbscan")              
    library("dbscan")
}
if (!require("pROC")) {
    install.packages("pROC")              
    library("pROC")
}
if (!require("ggridges")) {
    install.packages("ggridges")              
    library("ggridges")
}
if (!require("knitr")) {
    install.packages("knitr")              
    library("knitr")
}
if (!require("GGally")) {
    install.packages("GGally")              
    library("GGally")
}
if (!require("ggplot2")) {
    install.packages("ggplot2")              
    library("gglpot2")
}
if (!require("cluster")) {
    install.packages("cluster")              
    library("cluster")
}
if (!require("kableExtra")) {
    install.packages("kableExtra", dependencies = TRUE)              
    library("kableExtra")
}
if (!require("forcats")) {
    install.packages("forcats", dependencies = TRUE)              
    library("forcats")
}
if (!require("rpart")) {
    install.packages("rpart", dependencies = TRUE)              
    library("rpart")
}
if (!require("rpart.plot")) {
    install.packages("rpart.plot", dependencies = TRUE)              
    library("rpart.plot")
}
if (!require("metan")) {
    install.packages("metan", dependencies = TRUE)              
    library("metan")
}
 if (!require("factoextra")) {
   install.packages("factoextra", dependencies = TRUE)
   library("factoextra")
 }

knitr::opts_chunk$set(echo = TRUE,       
                      warning = FALSE,   
                      result = TRUE,   
                      message = FALSE,
                      comment = NA,
                      fig.align = 'center')

options(DT.options = list(pageLength = 5, scrollX = TRUE))
```


# Introduction

In the current world of high power data analytics and breadth of borrower data available for utilization, ensuring efficient borrower population segmentation and classification is central to the operating processes of any effective lending institution. Financial institutions and credit agencies must prioritize creation of accurate and detailed borrower profiles to accurately predict borrower outcome to ensure timely and complete repayment of loans as well as prevent fraudulent borrowing that will result in a complete financial loss for the institution.   


Through detailed customer segmentation lending institutions and credit agencies can streamline and improve their repayment rates and overall profit by:

  - Identifying borrower populations with greater repayment probability for targeted reduced APR loan marketing
  - Improving prediction accuracy to identify which customers are likely to go into default and provide preemptive mitigation strategies such as refinancing or adjusted repayment plans
  - Develop better live spending limit adjustments to support borrower needs and repayment capabilities
  - Improve fraud detection to reduce the number of fraudulent application approvals

Loan default data was obtained from <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">Applied Analytics through Case Studies Using SAS and R, Deepti Gupta</a> for analysis. Each of the below outlined variables as included in the data set serves as a categorical data point captured to help classify borrowers and predict their repayment ability. 

<h5>Variables</h5>

<table style="width:100%">
<thead><tr>
<th>Variable Name</th>
<th>Variable Type</th>
<th>Details</th>
</tr></thead>
<tr><td class = "td1">Loan Status</td><td>Categorical</td><td>Status of bank loan default (Default vs Current)</td></tr>
<tr><td class = "td1">Checking Amount</td><td>Numeric</td><td>Amount in borrower's checking account</td>
<tr><td class = "td1">Term</td><td>Numeric</td><td>Loan term in months</td></tr>
<tr><td class = "td1">Credit Score</td><td>Numeric</td><td>Borrower's credit score</td></tr>
<tr><td class = "td1">Gender</td><td>Categorical</td><td>Borrower's gender</td></tr>
<tr><td class = "td1">Marital Status</td><td>Categorical</td><td>Borrower's marital status</td></tr>
<tr><td class = "td1">Employment Status</td><td>Categorical</td><td>Borrower's employment status</td></tr>
<tr><td class = "td1">Amount</td><td>Numeric</td><td>Loan amount</td></tr>
<tr><td class = "td1">Saving Amount</td><td>Numeric</td><td>Ammount in borrower's saving account</td></tr>
<tr><td class = "td1">Age</td><td>Numeric</td><td>Duration of borrower's employment in months</td></tr>
<tr><td class = "td1">Number of Credit Accounts</td><td>Numeric</td><td>Number of credit accounts in borrower's name</td></tr>
<tr><td class = "td1">Car Loan</td><td>Categorical</td><td>If borrower holds a car loan</td></tr>
<tr><td class = "td1">Personal Loan</td><td>Categorical</td><td>If borrower holds a personal loan</td></tr>
<tr><td class = "td1">Home Loan</td><td>Categorical</td><td>If borrower holds a home loan</td></tr>
<tr><td class = "td1">Education Loan</td><td>Categorical</td><td>If borrower holds an education loan</td></tr>
<tr><td class = "td1">Any Loan</td><td>Categorical</td><td>A feature variable measuring how many of the 4 defined loans held by the borrower (personal, home, education, or car)</td></tr>
<tr><td class = "td1">Total Debt</td><td>Numeric</td><td>A feature variable measuring the total number of borrower's debts (4 defined loans and Number of Credit Accounts)</td></tr>
</table>

```{r}
raw.loan <- read.csv("https://nlepera.github.io/sta551/HW04/data/BankLoanDefaultDataset.csv")
raw.loan$Any_loan <- rowSums(raw.loan[c(7:10)] == 1)
raw.loan$Total_debt <- as.numeric(raw.loan$Any_loan) + as.numeric(raw.loan$No_of_credit_acc)
raw.loan$Loan_status <- factor(raw.loan$Default, levels = c(0:1), labels = c("Current","Default"))
raw.loan$Car_loan <- factor(raw.loan$Car_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Personal_loan <- factor(raw.loan$Personal_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Home_loan <- factor(raw.loan$Home_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Education_loan <- factor(raw.loan$Education_loan, levels = c(0:1), labels = c("No", "Yes"))

```

## EDA & Feature Variable Generation

In order to more efficiently utilize this data for borrower analysis, the number of variables needs to be reduced in a process called dimension reduction.  The dimensions of this data set were reduced by first combining the variables `Car Loan`, `Personal Loan`, `Home Loan`, `Education Loan` and `Number of Credit Accounts` to create a new variable entitled `Total Debt` that captures the total count of loans and credit accounts for each borrower.  This effectively reduces five variables into a single variable, with minimal loss of information. 

```{r}
loan <- raw.loan[-1] %>% 
  relocate(Loan_status, .before = Checking_amount) %>%
  relocate(Amount, .after = Loan_status) %>% 
  relocate(Term, .after = Amount) %>% 
  relocate(Total_debt, .after = Term) %>% 
  relocate(No_of_credit_acc, .before = Car_loan)

kable(head(loan), caption = "A breif glimpse at the borrower data as collected" ) %>% 
  kable_styling(full_width = FALSE) %>% 
  scroll_box(width ="100%")
```


The distribution of the `Total Debt` feature variable is included for review.

```{r, fig.width=9}

par(mfrow = c(1,2))
hist(x = loan$Total_debt, 
     prob = TRUE,
     main = "Distribution of 
Borrower Total Debt",
     xlab = "Total Number of Debt Accounts")

plot(x = loan$Loan_status,
     y = loan$Total_debt,
     col = c("skyblue", "darkred"),
     main = "Total Debt by Loan Status
(Current vs Default)",
     xlab = "Borrower Loan Default Status
(Current vs Default)",
     ylab = "Total Number of Borrower Debt Accounts")
```




# K Means Cluster Analysis (Full Data)

A rudimentary cluster analysis of the numeric variables (Loan status, Amount, Term, Checking amount, Credit score, Saving amount, Employment Duration, Age, Total debt) provides an initial glimpse at the borrower profile data, and underscores the need for further dimension reduction prior to cluster analysis.  

When assessing fo the optional cluster number utilizing the silhouette method, it demonstrated that the optimal number of clusters is two.  Such a small number of clusters indicates a poor cluster model fit and lacks predictive power.  As demonstrated by the significant overlap in the two clusters, a high level cluster analysis of all numeric borrower profile data does not provide proper borrower segmentation for identification and loan default prediction.

Utilizing such a poorly fit model for various business purposes could result in ineffective targeted marketing, poor pre-approval models resulting in higher back end costs reviewing and rejecting poorly fit applicants, and overall poor risk analysis in initial borrower requests. 


```{r, warning = FALSE, fig.width=9, fig.height=9}
loan.cluster <- loan[, c(1:6, 15:17)]
loan.cluster$Loan_status <- as.numeric(loan.cluster$Loan_status)
loan.cluster.group <- kmeans(x = loan.cluster, centers = 2)
loan.clustID <- loan.cluster.group$cluster


plot1 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "wss")
plot2 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "silhouette")


gridExtra::grid.arrange(plot1,plot2)

clusplot(loan.cluster,
         loan.clustID,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T)
```



# Heirarchal Data Clustering - Agglomerative

The following variables were analyzed to determine the borrower's repayment profile:

  - Amount in borrower's saving account `Saving_amount`
  - Borrower's credit score `Credit_score`
  - Loan amount `Amount`
  - Status of bank loan default (Default vs Current) `Loan_status`
  

Utilizing a 'bottom-up' approach the data was split into clusters starting with single data points and successively merging clusters. This allows for tuning of the cluster sizes without pre-determining cluster count.  Therefore, overall borrower profiles can be created from the data without pre-determining segments allowing in improved borrower profiling and prediction.


All data was run both raw and scaled.  Raw data did not present significant skew, and scaled data presented over fitting issues resulting in a best fit of two clusters. No missing values were identified in the data set, therefore no imputation was required. 

```{r}
heirarch.loans <- loan[, c("Checking_amount", "Amount", "Emp_duration", "Loan_status")]
heirarch.distance <- dist(heirarch.loans, method = "euclidean")
hcluster1 <- hclust(heirarch.distance, method = "complete")
```

## Determining Cluster Count

By examining the current borrower data to determine both the number and dimensions of the clusters, the borrower profile clusters can be created to better align with the current borrower profile base.  This will allow for improved accuracy in borrower outcome prediction over the life cycle of the loan.  For example, as a loan progresses, if a borrower originally profiled as a high likelihood of repayment suddenly has a dip in employment duration (i.e. laid off), accurate and timely reassessment of the borrower profile will allow for rapid revocation of any pending lending offers or credit increases, as well as flag the borrower for targeted repayment and or refinancing programs to reduce risk of loan delinquency 

```{r, fig.height=9, fig.width=7}
plot3 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "wss")
plot4 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot3, plot4)
```



## Final Cluster analysis

Optimal clustering was indicated as 3 clusters via the silhouette method and was selected as the pre-defined number of clusters for hierarchical agglomerate cluster analysis.  Unlike the clustering as seen in the initial k-means cluster analysis, by utilizing hierarchical agglomerate clustering, the clusters are created before the best fit number of clusters is selected.  This unsupervised algorithm allows for the clustering to be determined directly from the data itself rather than being pre-assigned based on assumptions.

This hierarchical clustering is visually represented in the below dendrogram.  The clusters as selected are highlighted visually with different color square overlays. Inside the bounds of each box are the borrowers falling into that cluster.  A scatterplot is also included to demonstrate the mapping of these clusters as assigned by the hclust() function. 

```{r, fig.width=9}

hcluster2 <- hcluster1
hCluster_group <- cutree(hcluster2, k = 3)
heirarch.loans$hCluster_group = hCluster_group

plot(hcluster2, cex = 0.6, labels = FALSE, hang = -1, xlab = "", main = "Dendogram of Borrower Profile Clustering")
par(lwd = 3)
rect.hclust(hcluster1, k = 3, border = 2:9)

par(lwd=1)
clusplot(heirarch.loans,
         hCluster_group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T,
         main = "Cluster Plot of Raw Borrower Data")


```




# Principal Component Analysis (PCA) & Reducing dimensions

To reduce, the dimensions must be scaled. All data transformed via `scale()` function as log scaling not possible with employment duration as some duration = 0 months. ("Checking_amount", "Amount", "Emp_duration", "Loan_status")

```{r}
pca.loans.scale <- heirarch.loans[,-5]

pca.loans.scale$Checking_amount <- scale(pca.loans.scale$Checking_amount)
pca.loans.scale$Amount <- scale(pca.loans.scale$Amount)
pca.loans.scale$Emp_duration <- scale(pca.loans.scale$Emp_duration)
pca.loans.scale$Loan_status <- scale(as.numeric(pca.loans.scale$Loan_status))


pca.loans <- prcomp(pca.loans.scale, center = TRUE, scale = TRUE)


kable(round(pca.loans$rotation, 2), caption = "Factor loading of Borrower Profile PCA") %>% 
  kable_styling()
kable(round(summary(pca.loans)$importance, 3), caption = "Importance of each componant of Borrower Profile PCA") %>% 
  kable_styling()
```


As demonstrated by the above PCA tables, the first three principal components account for 86.6% of the variation in the borrower profile data.  The equation for each principal component is included below: 

$$PC_1= -0.65[Checking.amount] + 0.30[Loan.amount] - 0.21[Employment.duration] + 0.67[Loan.status]$$
$$PC_2= -0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] - 0.03[Loan.status]$$
$$PC_3= 0.31[Checking.amount] + 0.73[Loan.amount] - 0.57[Employment.duration] - 0.20[Loan.status]$$

PC1 most impacted by checking amount and loan status with light impact from loan amount and employment duration, we will call this measure Immediate Financial Status Index
PC2 most impacted by loan amount and employment duration with near 0 impact from checking amount and loan status, we will call this measure Repayment Security Index 

PC3 will be dropped ass PC2 and PC3 have same largest influence variables (loan amount and employment duration), but PC2 has a greater proportion of variance than PC3. 

## Determne Best Number of Clusters

The same process was utilized to determine the best fit number of clusters for the PCA variables, again outlining 3 clusters as the best fit.  Despite this best fit silhouette prediction, four clusters were utilized to reduce overall cluster overlap. 

```{r}

pca.cluster <- data.frame(Imed_fin_stat_PC1 = pca.loans$x[,1],
                          Repay_sec_PC2 = pca.loans$x[,2])

plot5 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "wss")
plot6 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot5, plot6)

```

```{r}


pca.distance <- dist(pca.cluster, method = "euclidean")
pca.clust <- hclust(pca.distance, method = "complete")
pca.cluster.group <- cutree(pca.clust, 4)

heirarch.loans$pcaCluster = as.character(pca.cluster.group)

clusplot(pca.cluster,
         pca.cluster.group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         xlab = "Immediate Financial Status Index",
         ylab = "Repayment Security Index",
         plotchar = F,
         span = T,
         main="Cluster Plot of PCA Components")

```

In reviewing the principal component variables are found to improve borrower profile predictive capabilities through dimension reduction.  By combining four variables into two component analysis variables, the data can be more easily manipulated for borrower outcome predictions at the time of lending and increased targeted marketing for customers indicated to have a strong financial status index and repayment security index.   This allows for borrower segmentation into four clusters:

  - High immediate financial status & High Repayment Security
  - High Immediate Financial Status & Low Repayment Security
  - Low Immediate Financial Status & High Repayment Security
  - Low Immediate Financial Status & Low Repayment Security
  

These four clusters that comprise the borrower profile can allow for a targeted customer based approach.  Those with high financial status and high repayment security can be targeted for low interest rate loans with a higher down payment marketing to entice further borrowing from a low risk borrower.  Where as those with low immediate financial status & high repayment security can be targeted for micro to medium sized loans with minimal to no down payments to entice further borrowing from a low risk borrower that may not have the funds for a traditional high down payment loan. 

# Outlier analysis

Outlier identification can provide further support in both effective targeted marketing, appropriate borrower application denials, and fraudulent borrower application identification. Outliers to the borrower profile groups should be manually reviewed to determine if the borrower is at risk for adjusting one of their principal component classifications (ex: moving from high immediate financial status to low, etc), if the borrower data is potentially fraudulent, or if the borrower data indicates a borrower that requires potential account termination and/or future borrower application denial for lack of repayment ability.  

Additionally, properly identifying outliers allows for removal of outliers prior to PCA or hierarchical agglomerate clustering will result in increased borrower repayment profile classification and improved predictive capacity.

## Local Outlier Factor (LOF) Score

A local outlier factor (LOF) score is calculated utilizing the below equation to compare a data point's local reachability density (LRD) of the nearest k neighbors to point $A_i$ for $i = 1 , 2, ... , n$. 

$$
LOF(A_i) = \frac{\frac{\sum_{i = A_j \in N_kA_i} {LRD_k(A_j)}}{||N_k(A_i)||}} {LRD_k(A_i)}
$$

The LOF factor acts as an easily filterable scale variable to quickly identify outlier values.  An LOF > 1 indicates a potential outlier, with the greater values for LOF indiciating more extreme outliers. 

```{r}

lof.pca.cluster <- lof(pca.cluster, minPts = 50)
pca.cluster$LOF <- lof.pca.cluster

pander::pander(summary(lof.pca.cluster), caption = "Summary statistics of LOF scores for Immediate Financial Status Index & Repayment Security Index")
```

Based on the summary statistics for the calculated LOF scores for the PCA feature variables the outlier cutoff was selected as an LOF value of 1.8.  This value was selected as the cutoff to ensure less than 1% of the dataset is identified as an outlier, leading to an outlier flagging of 1 in every 100 borrowers. Selecting an LOF value of 1.7 resulted in > 1% outlier flagging. This 1% outlier flagging will scale well to the lender or credit agency's operational constraints until the need for hyperparameter (k) tuning is required. 

```{r}
plot(x = pca.cluster$Imed_fin_stat_PC1,
     y = pca.cluster$Repay_sec_PC2,
     pch = "x",
     cex = 0.5,
     xlab = "Immediate Financial Status Index",
     ylab = "Repayment Security Index",
     main="Outlier Identification Amongst PCA Components")
points(pca.cluster,
       cex = ((lof.pca.cluster - 1)*1.5),
       pch = 21,
       col = "hotpink")
text(pca.cluster[lof.pca.cluster > 1.8,],
     labels = round(lof.pca.cluster, 1)[lof.pca.cluster >1.8],
     pos = 2,
     cex = 0.6,
     col = "darkred")

kable(filter(pca.cluster, LOF > 1.8), caption = "listing of all LOF scores > 1.8") %>% 
  kable_styling()
```

# Conclusions

Overall proper borrower profile segmentation will allow for improved loan default prediction models, improved identification of fraudulent pre-approval applications, and improved targeted marketing to drive up borrowing rates from borrowers with a high repayment profile.  Overall this borrower segmentation and classification may also be used for predictive analysis regarding borrower pre-approval determinations.  Overall borrower population segmentation remains a highly effective tool for managing and predicting overall loan outcomes.


# References

Data source:

Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6
Accessed via: <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv</a>