1 Introduction

In the current world of high power data analytics and breadth of borrower data available for utilization, ensuring efficient borrower population segmentation and classification is central to the operating processes of any effective lending institution. Financial institutions and credit agencies must prioritize creation of accurate and detailed borrower profiles to accurately predict borrower outcome to ensure timely and complete repayment of loans as well as prevent fraudulent borrowing that will result in a complete financial loss for the institution.

Through detailed customer segmentation lending institutions and credit agencies can streamline and improve their repayment rates and overall profit by:

  • Identifying borrower populations with greater repayment probability for targeted reduced APR loan marketing
  • Improving prediction accuracy to identify which customers are likely to go into default and provide preemptive mitigation strategies such as refinancing or adjusted repayment plans
  • Develop better live spending limit adjustments to support borrower needs and repayment capabilities
  • Improve fraud detection to reduce the number of fraudulent application approvals

Loan default data was obtained from Applied Analytics through Case Studies Using SAS and R, Deepti Gupta for analysis. Each of the below outlined variables as included in the data set serves as a categorical data point captured to help classify borrowers and predict their repayment ability.

Variables
Variable Name Variable Type Details
Loan Status Categorical Status of bank loan default (Default vs Current)
Checking Amount Numeric Amount in borrower’s checking account
Term Numeric Loan term in months
Credit Score Numeric Borrower’s credit score
Gender Categorical Borrower’s gender
Marital Status Categorical Borrower’s marital status
Employment Status Categorical Borrower’s employment status
Amount Numeric Loan amount
Saving Amount Numeric Ammount in borrower’s saving account
Age Numeric Duration of borrower’s employment in months
Number of Credit Accounts Numeric Number of credit accounts in borrower’s name
Car Loan Categorical If borrower holds a car loan
Personal Loan Categorical If borrower holds a personal loan
Home Loan Categorical If borrower holds a home loan
Education Loan Categorical If borrower holds an education loan
Any Loan Categorical A feature variable measuring how many of the 4 defined loans held by the borrower (personal, home, education, or car)
Total Debt Numeric A feature variable measuring the total number of borrower’s debts (4 defined loans and Number of Credit Accounts)
raw.loan <- read.csv("https://nlepera.github.io/sta551/HW04/data/BankLoanDefaultDataset.csv")
raw.loan$Any_loan <- rowSums(raw.loan[c(7:10)] == 1)
raw.loan$Total_debt <- as.numeric(raw.loan$Any_loan) + as.numeric(raw.loan$No_of_credit_acc)
raw.loan$Loan_status <- factor(raw.loan$Default, levels = c(0:1), labels = c("Current","Default"))
raw.loan$Car_loan <- factor(raw.loan$Car_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Personal_loan <- factor(raw.loan$Personal_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Home_loan <- factor(raw.loan$Home_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Education_loan <- factor(raw.loan$Education_loan, levels = c(0:1), labels = c("No", "Yes"))

1.1 EDA & Feature Variable Generation

In order to more efficiently utilize this data for borrower analysis, the number of variables needs to be reduced in a process called dimension reduction. The dimensions of this data set were reduced by first combining the variables Car Loan, Personal Loan, Home Loan, Education Loan and Number of Credit Accounts to create a new variable entitled Total Debt that captures the total count of loans and credit accounts for each borrower. This effectively reduces five variables into a single variable, with minimal loss of information.

loan <- raw.loan[-1] %>% 
  relocate(Loan_status, .before = Checking_amount) %>%
  relocate(Amount, .after = Loan_status) %>% 
  relocate(Term, .after = Amount) %>% 
  relocate(Total_debt, .after = Term) %>% 
  relocate(No_of_credit_acc, .before = Car_loan)

kable(head(loan), caption = "A breif glimpse at the borrower data as collected" ) %>% 
  kable_styling(full_width = FALSE) %>% 
  scroll_box(width ="100%")
A breif glimpse at the borrower data as collected
Loan_status Amount Term Total_debt Checking_amount Credit_score Gender Marital_status No_of_credit_acc Car_loan Personal_loan Home_loan Education_loan Emp_status Saving_amount Emp_duration Age Any_loan
Current 1536 15 2 988 796 Female Single 1 Yes No No No employed 3455 12 38 1
Current 947 15 2 458 813 Female Single 1 Yes No No No employed 3600 25 36 1
Current 1678 14 2 158 756 Female Single 1 No Yes No No employed 3093 43 34 1
Default 1804 25 2 300 737 Female Single 1 No No No Yes employed 2449 0 29 1
Default 1184 24 2 63 662 Female Single 1 No No No Yes unemployed 2867 4 30 1
Current 475 20 3 1071 828 Male Married 2 Yes No No No employed 3282 12 32 1

The distribution of the Total Debt feature variable is included for review.

par(mfrow = c(1,2))
hist(x = loan$Total_debt, 
     prob = TRUE,
     main = "Distribution of 
Borrower Total Debt",
     xlab = "Total Number of Debt Accounts")

plot(x = loan$Loan_status,
     y = loan$Total_debt,
     col = c("skyblue", "darkred"),
     main = "Total Debt by Loan Status
(Current vs Default)",
     xlab = "Borrower Loan Default Status
(Current vs Default)",
     ylab = "Total Number of Borrower Debt Accounts")

2 K Means Cluster Analysis (Full Data)

A rudimentary cluster analysis of the numeric variables (Loan status, Amount, Term, Checking amount, Credit score, Saving amount, Employment Duration, Age, Total debt) provides an initial glimpse at the borrower profile data, and underscores the need for further dimension reduction prior to cluster analysis.

When assessing fo the optional cluster number utilizing the silhouette method, it demonstrated that the optimal number of clusters is two. Such a small number of clusters indicates a poor cluster model fit and lacks predictive power. As demonstrated by the significant overlap in the two clusters, a high level cluster analysis of all numeric borrower profile data does not provide proper borrower segmentation for identification and loan default prediction.

Utilizing such a poorly fit model for various business purposes could result in ineffective targeted marketing, poor pre-approval models resulting in higher back end costs reviewing and rejecting poorly fit applicants, and overall poor risk analysis in initial borrower requests.

loan.cluster <- loan[, c(1:6, 15:17)]
loan.cluster$Loan_status <- as.numeric(loan.cluster$Loan_status)
loan.cluster.group <- kmeans(x = loan.cluster, centers = 2)
loan.clustID <- loan.cluster.group$cluster


plot1 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "wss")
plot2 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "silhouette")


gridExtra::grid.arrange(plot1,plot2)

clusplot(loan.cluster,
         loan.clustID,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T)

3 Heirarchal Data Clustering - Agglomerative

The following variables were analyzed to determine the borrower’s repayment profile:

  • Amount in borrower’s saving account Saving_amount
  • Borrower’s credit score Credit_score
  • Loan amount Amount
  • Status of bank loan default (Default vs Current) Loan_status

Utilizing a ‘bottom-up’ approach the data was split into clusters starting with single data points and successively merging clusters. This allows for tuning of the cluster sizes without pre-determining cluster count. Therefore, overall borrower profiles can be created from the data without pre-determining segments allowing in improved borrower profiling and prediction.

All data was run both raw and scaled. Raw data did not present significant skew, and scaled data presented over fitting issues resulting in a best fit of two clusters. No missing values were identified in the data set, therefore no imputation was required.

heirarch.loans <- loan[, c("Checking_amount", "Amount", "Emp_duration", "Loan_status")]
heirarch.distance <- dist(heirarch.loans, method = "euclidean")
hcluster1 <- hclust(heirarch.distance, method = "complete")

3.1 Determining Cluster Count

By examining the current borrower data to determine both the number and dimensions of the clusters, the borrower profile clusters can be created to better align with the current borrower profile base. This will allow for improved accuracy in borrower outcome prediction over the life cycle of the loan. For example, as a loan progresses, if a borrower originally profiled as a high likelihood of repayment suddenly has a dip in employment duration (i.e. laid off), accurate and timely reassessment of the borrower profile will allow for rapid revocation of any pending lending offers or credit increases, as well as flag the borrower for targeted repayment and or refinancing programs to reduce risk of loan delinquency

plot3 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "wss")
plot4 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot3, plot4)

3.2 Final Cluster analysis

Optimal clustering was indicated as 3 clusters via the silhouette method and was selected as the pre-defined number of clusters for hierarchical agglomerate cluster analysis. Unlike the clustering as seen in the initial k-means cluster analysis, by utilizing hierarchical agglomerate clustering, the clusters are created before the best fit number of clusters is selected. This unsupervised algorithm allows for the clustering to be determined directly from the data itself rather than being pre-assigned based on assumptions.

This hierarchical clustering is visually represented in the below dendrogram. The clusters as selected are highlighted visually with different color square overlays. Inside the bounds of each box are the borrowers falling into that cluster. A scatterplot is also included to demonstrate the mapping of these clusters as assigned by the hclust() function.

hcluster2 <- hcluster1
hCluster_group <- cutree(hcluster2, k = 3)
heirarch.loans$hCluster_group = hCluster_group

plot(hcluster2, cex = 0.6, labels = FALSE, hang = -1, xlab = "", main = "Dendogram of Borrower Profile Clustering")
par(lwd = 3)
rect.hclust(hcluster1, k = 3, border = 2:9)

par(lwd=1)
clusplot(heirarch.loans,
         hCluster_group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T,
         main = "Cluster Plot of Raw Borrower Data")

4 Principal Component Analysis (PCA) & Reducing dimensions

To reduce, the dimensions must be scaled. All data transformed via scale() function as log scaling not possible with employment duration as some duration = 0 months. (“Checking_amount”, “Amount”, “Emp_duration”, “Loan_status”)

pca.loans.scale <- heirarch.loans[,-5]

pca.loans.scale$Checking_amount <- scale(pca.loans.scale$Checking_amount)
pca.loans.scale$Amount <- scale(pca.loans.scale$Amount)
pca.loans.scale$Emp_duration <- scale(pca.loans.scale$Emp_duration)
pca.loans.scale$Loan_status <- scale(as.numeric(pca.loans.scale$Loan_status))


pca.loans <- prcomp(pca.loans.scale, center = TRUE, scale = TRUE)


kable(round(pca.loans$rotation, 2), caption = "Factor loading of Borrower Profile PCA") %>% 
  kable_styling()
Factor loading of Borrower Profile PCA
PC1 PC2 PC3 PC4
Checking_amount -0.65 -0.01 0.31 -0.69
Amount 0.30 0.61 0.73 0.04
Emp_duration -0.21 0.79 -0.57 -0.07
Loan_status 0.67 -0.03 -0.20 -0.72
kable(round(summary(pca.loans)$importance, 3), caption = "Importance of each componant of Borrower Profile PCA") %>% 
  kable_styling()
Importance of each componant of Borrower Profile PCA
PC1 PC2 PC3 PC4
Standard deviation 1.243 1.009 0.949 0.733
Proportion of Variance 0.386 0.254 0.225 0.134
Cumulative Proportion 0.386 0.641 0.866 1.000

As demonstrated by the above PCA tables, the first three principal components account for 86.6% of the variation in the borrower profile data. The equation for each principal component is included below:

\[PC_1= -0.65[Checking.amount] + 0.30[Loan.amount] - 0.21[Employment.duration] + 0.67[Loan.status]\] \[PC_2= -0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] - 0.03[Loan.status]\] \[PC_3= 0.31[Checking.amount] + 0.73[Loan.amount] - 0.57[Employment.duration] - 0.20[Loan.status]\]

PC1 most impacted by checking amount and loan status with light impact from loan amount and employment duration, we will call this measure Immediate Financial Status Index PC2 most impacted by loan amount and employment duration with near 0 impact from checking amount and loan status, we will call this measure Repayment Security Index

PC3 will be dropped ass PC2 and PC3 have same largest influence variables (loan amount and employment duration), but PC2 has a greater proportion of variance than PC3.

4.1 Determne Best Number of Clusters

The same process was utilized to determine the best fit number of clusters for the PCA variables, again outlining 3 clusters as the best fit. Despite this best fit silhouette prediction, four clusters were utilized to reduce overall cluster overlap.

pca.cluster <- data.frame(Imed_fin_stat_PC1 = pca.loans$x[,1],
                          Repay_sec_PC2 = pca.loans$x[,2])

plot5 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "wss")
plot6 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot5, plot6)

pca.distance <- dist(pca.cluster, method = "euclidean")
pca.clust <- hclust(pca.distance, method = "complete")
pca.cluster.group <- cutree(pca.clust, 4)

heirarch.loans$pcaCluster = as.character(pca.cluster.group)

clusplot(pca.cluster,
         pca.cluster.group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         xlab = "Immediate Financial Status Index",
         ylab = "Repayment Security Index",
         plotchar = F,
         span = T,
         main="Cluster Plot of PCA Components")

In reviewing the principal component variables are found to improve borrower profile predictive capabilities through dimension reduction. By combining four variables into two component analysis variables, the data can be more easily manipulated for borrower outcome predictions at the time of lending and increased targeted marketing for customers indicated to have a strong financial status index and repayment security index. This allows for borrower segmentation into four clusters:

  • High immediate financial status & High Repayment Security
  • High Immediate Financial Status & Low Repayment Security
  • Low Immediate Financial Status & High Repayment Security
  • Low Immediate Financial Status & Low Repayment Security

These four clusters that comprise the borrower profile can allow for a targeted customer based approach. Those with high financial status and high repayment security can be targeted for low interest rate loans with a higher down payment marketing to entice further borrowing from a low risk borrower. Where as those with low immediate financial status & high repayment security can be targeted for micro to medium sized loans with minimal to no down payments to entice further borrowing from a low risk borrower that may not have the funds for a traditional high down payment loan.

5 Outlier analysis

Outlier identification can provide further support in both effective targeted marketing, appropriate borrower application denials, and fraudulent borrower application identification. Outliers to the borrower profile groups should be manually reviewed to determine if the borrower is at risk for adjusting one of their principal component classifications (ex: moving from high immediate financial status to low, etc), if the borrower data is potentially fraudulent, or if the borrower data indicates a borrower that requires potential account termination and/or future borrower application denial for lack of repayment ability.

Additionally, properly identifying outliers allows for removal of outliers prior to PCA or hierarchical agglomerate clustering will result in increased borrower repayment profile classification and improved predictive capacity.

5.1 Local Outlier Factor (LOF) Score

A local outlier factor (LOF) score is calculated utilizing the below equation to compare a data point’s local reachability density (LRD) of the nearest k neighbors to point \(A_i\) for \(i = 1 , 2, ... , n\).

\[ LOF(A_i) = \frac{\frac{\sum_{i = A_j \in N_kA_i} {LRD_k(A_j)}}{||N_k(A_i)||}} {LRD_k(A_i)} \]

The LOF factor acts as an easily filterable scale variable to quickly identify outlier values. An LOF > 1 indicates a potential outlier, with the greater values for LOF indiciating more extreme outliers.

lof.pca.cluster <- lof(pca.cluster, minPts = 50)
pca.cluster$LOF <- lof.pca.cluster

pander::pander(summary(lof.pca.cluster), caption = "Summary statistics of LOF scores for Immediate Financial Status Index & Repayment Security Index")
Summary statistics of LOF scores for Immediate Financial Status Index & Repayment Security Index
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9571 0.9939 1.017 1.064 1.079 2.315

Based on the summary statistics for the calculated LOF scores for the PCA feature variables the outlier cutoff was selected as an LOF value of 1.8. This value was selected as the cutoff to ensure less than 1% of the dataset is identified as an outlier, leading to an outlier flagging of 1 in every 100 borrowers. Selecting an LOF value of 1.7 resulted in > 1% outlier flagging. This 1% outlier flagging will scale well to the lender or credit agency’s operational constraints until the need for hyperparameter (k) tuning is required.

plot(x = pca.cluster$Imed_fin_stat_PC1,
     y = pca.cluster$Repay_sec_PC2,
     pch = "x",
     cex = 0.5,
     xlab = "Immediate Financial Status Index",
     ylab = "Repayment Security Index",
     main="Outlier Identification Amongst PCA Components")
points(pca.cluster,
       cex = ((lof.pca.cluster - 1)*1.5),
       pch = 21,
       col = "hotpink")
text(pca.cluster[lof.pca.cluster > 1.8,],
     labels = round(lof.pca.cluster, 1)[lof.pca.cluster >1.8],
     pos = 2,
     cex = 0.6,
     col = "darkred")

kable(filter(pca.cluster, LOF > 1.8), caption = "listing of all LOF scores > 1.8") %>% 
  kable_styling()
listing of all LOF scores > 1.8
Imed_fin_stat_PC1 Repay_sec_PC2 LOF
-2.481105 -2.2764407 2.009484
3.396802 0.6592382 1.882003
-3.034711 -0.0338831 1.888966
-2.282020 -2.8414007 2.315474
2.926977 2.3990462 1.973623

6 Conclusions

Overall proper borrower profile segmentation will allow for improved loan default prediction models, improved identification of fraudulent pre-approval applications, and improved targeted marketing to drive up borrowing rates from borrowers with a high repayment profile. Overall this borrower segmentation and classification may also be used for predictive analysis regarding borrower pre-approval determinations. Overall borrower population segmentation remains a highly effective tool for managing and predicting overall loan outcomes.

7 References

Data source:

Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6 Accessed via: https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv

---
title: "Can you pay me back? :<img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Loan Default Borrower Data Clustering & Analysis"
author:
- name: Natalie LePera
  affiliation: West Chester University | STA551 - HW 04
date: "12 Dec 2024"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
}

body {
  background-color:white;
}

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
}

h5 {
  color: navy;
}

.iframe {
  text-align: center;
}

a:link {
  color: darkmagenta;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
}

.td1 {
  font-weight: bold;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
if (!require("dplyr")) {
    install.packages("dplyr")              
    library("dplyr")
}

if (!require("plyr")) {
    install.packages("plyr")              
    library("plyr")
}

if (!require("stringr")) {
    install.packages("stringr")              
    library("stringr")
}

if (!require("plotly")) {
    install.packages("plotly")              
    library("plotly")
}

if (!require("pandoc")) {
    install.packages("pandoc")              
    library("pandoc")
}

if (!require("gridExtra")) {
    install.packages("gridExtra")              
    library("gridExtra")
}

if (!require("grid")) {
    install.packages("grid")              
    library("grid")
}
if (!require("raster")) {
    install.packages("raster")              
    library("raster")
}
if (!require("dbscan")) {
    install.packages("dbscan")              
    library("dbscan")
}
if (!require("pROC")) {
    install.packages("pROC")              
    library("pROC")
}
if (!require("ggridges")) {
    install.packages("ggridges")              
    library("ggridges")
}
if (!require("knitr")) {
    install.packages("knitr")              
    library("knitr")
}
if (!require("GGally")) {
    install.packages("GGally")              
    library("GGally")
}
if (!require("ggplot2")) {
    install.packages("ggplot2")              
    library("gglpot2")
}
if (!require("cluster")) {
    install.packages("cluster")              
    library("cluster")
}
if (!require("kableExtra")) {
    install.packages("kableExtra", dependencies = TRUE)              
    library("kableExtra")
}
if (!require("forcats")) {
    install.packages("forcats", dependencies = TRUE)              
    library("forcats")
}
if (!require("rpart")) {
    install.packages("rpart", dependencies = TRUE)              
    library("rpart")
}
if (!require("rpart.plot")) {
    install.packages("rpart.plot", dependencies = TRUE)              
    library("rpart.plot")
}
if (!require("metan")) {
    install.packages("metan", dependencies = TRUE)              
    library("metan")
}
 if (!require("factoextra")) {
   install.packages("factoextra", dependencies = TRUE)
   library("factoextra")
 }

knitr::opts_chunk$set(echo = TRUE,       
                      warning = FALSE,   
                      result = TRUE,   
                      message = FALSE,
                      comment = NA,
                      fig.align = 'center')

options(DT.options = list(pageLength = 5, scrollX = TRUE))
```


# Introduction

In the current world of high power data analytics and breadth of borrower data available for utilization, ensuring efficient borrower population segmentation and classification is central to the operating processes of any effective lending institution. Financial institutions and credit agencies must prioritize creation of accurate and detailed borrower profiles to accurately predict borrower outcome to ensure timely and complete repayment of loans as well as prevent fraudulent borrowing that will result in a complete financial loss for the institution.   


Through detailed customer segmentation lending institutions and credit agencies can streamline and improve their repayment rates and overall profit by:

  - Identifying borrower populations with greater repayment probability for targeted reduced APR loan marketing
  - Improving prediction accuracy to identify which customers are likely to go into default and provide preemptive mitigation strategies such as refinancing or adjusted repayment plans
  - Develop better live spending limit adjustments to support borrower needs and repayment capabilities
  - Improve fraud detection to reduce the number of fraudulent application approvals

Loan default data was obtained from <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">Applied Analytics through Case Studies Using SAS and R, Deepti Gupta</a> for analysis. Each of the below outlined variables as included in the data set serves as a categorical data point captured to help classify borrowers and predict their repayment ability. 

<h5>Variables</h5>

<table style="width:100%">
<thead><tr>
<th>Variable Name</th>
<th>Variable Type</th>
<th>Details</th>
</tr></thead>
<tr><td class = "td1">Loan Status</td><td>Categorical</td><td>Status of bank loan default (Default vs Current)</td></tr>
<tr><td class = "td1">Checking Amount</td><td>Numeric</td><td>Amount in borrower's checking account</td>
<tr><td class = "td1">Term</td><td>Numeric</td><td>Loan term in months</td></tr>
<tr><td class = "td1">Credit Score</td><td>Numeric</td><td>Borrower's credit score</td></tr>
<tr><td class = "td1">Gender</td><td>Categorical</td><td>Borrower's gender</td></tr>
<tr><td class = "td1">Marital Status</td><td>Categorical</td><td>Borrower's marital status</td></tr>
<tr><td class = "td1">Employment Status</td><td>Categorical</td><td>Borrower's employment status</td></tr>
<tr><td class = "td1">Amount</td><td>Numeric</td><td>Loan amount</td></tr>
<tr><td class = "td1">Saving Amount</td><td>Numeric</td><td>Ammount in borrower's saving account</td></tr>
<tr><td class = "td1">Age</td><td>Numeric</td><td>Duration of borrower's employment in months</td></tr>
<tr><td class = "td1">Number of Credit Accounts</td><td>Numeric</td><td>Number of credit accounts in borrower's name</td></tr>
<tr><td class = "td1">Car Loan</td><td>Categorical</td><td>If borrower holds a car loan</td></tr>
<tr><td class = "td1">Personal Loan</td><td>Categorical</td><td>If borrower holds a personal loan</td></tr>
<tr><td class = "td1">Home Loan</td><td>Categorical</td><td>If borrower holds a home loan</td></tr>
<tr><td class = "td1">Education Loan</td><td>Categorical</td><td>If borrower holds an education loan</td></tr>
<tr><td class = "td1">Any Loan</td><td>Categorical</td><td>A feature variable measuring how many of the 4 defined loans held by the borrower (personal, home, education, or car)</td></tr>
<tr><td class = "td1">Total Debt</td><td>Numeric</td><td>A feature variable measuring the total number of borrower's debts (4 defined loans and Number of Credit Accounts)</td></tr>
</table>

```{r}
raw.loan <- read.csv("https://nlepera.github.io/sta551/HW04/data/BankLoanDefaultDataset.csv")
raw.loan$Any_loan <- rowSums(raw.loan[c(7:10)] == 1)
raw.loan$Total_debt <- as.numeric(raw.loan$Any_loan) + as.numeric(raw.loan$No_of_credit_acc)
raw.loan$Loan_status <- factor(raw.loan$Default, levels = c(0:1), labels = c("Current","Default"))
raw.loan$Car_loan <- factor(raw.loan$Car_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Personal_loan <- factor(raw.loan$Personal_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Home_loan <- factor(raw.loan$Home_loan, levels = c(0:1), labels = c("No", "Yes"))
raw.loan$Education_loan <- factor(raw.loan$Education_loan, levels = c(0:1), labels = c("No", "Yes"))

```

## EDA & Feature Variable Generation

In order to more efficiently utilize this data for borrower analysis, the number of variables needs to be reduced in a process called dimension reduction.  The dimensions of this data set were reduced by first combining the variables `Car Loan`, `Personal Loan`, `Home Loan`, `Education Loan` and `Number of Credit Accounts` to create a new variable entitled `Total Debt` that captures the total count of loans and credit accounts for each borrower.  This effectively reduces five variables into a single variable, with minimal loss of information. 

```{r}
loan <- raw.loan[-1] %>% 
  relocate(Loan_status, .before = Checking_amount) %>%
  relocate(Amount, .after = Loan_status) %>% 
  relocate(Term, .after = Amount) %>% 
  relocate(Total_debt, .after = Term) %>% 
  relocate(No_of_credit_acc, .before = Car_loan)

kable(head(loan), caption = "A breif glimpse at the borrower data as collected" ) %>% 
  kable_styling(full_width = FALSE) %>% 
  scroll_box(width ="100%")
```


The distribution of the `Total Debt` feature variable is included for review.

```{r, fig.width=9}

par(mfrow = c(1,2))
hist(x = loan$Total_debt, 
     prob = TRUE,
     main = "Distribution of 
Borrower Total Debt",
     xlab = "Total Number of Debt Accounts")

plot(x = loan$Loan_status,
     y = loan$Total_debt,
     col = c("skyblue", "darkred"),
     main = "Total Debt by Loan Status
(Current vs Default)",
     xlab = "Borrower Loan Default Status
(Current vs Default)",
     ylab = "Total Number of Borrower Debt Accounts")
```




# K Means Cluster Analysis (Full Data)

A rudimentary cluster analysis of the numeric variables (Loan status, Amount, Term, Checking amount, Credit score, Saving amount, Employment Duration, Age, Total debt) provides an initial glimpse at the borrower profile data, and underscores the need for further dimension reduction prior to cluster analysis.  

When assessing fo the optional cluster number utilizing the silhouette method, it demonstrated that the optimal number of clusters is two.  Such a small number of clusters indicates a poor cluster model fit and lacks predictive power.  As demonstrated by the significant overlap in the two clusters, a high level cluster analysis of all numeric borrower profile data does not provide proper borrower segmentation for identification and loan default prediction.

Utilizing such a poorly fit model for various business purposes could result in ineffective targeted marketing, poor pre-approval models resulting in higher back end costs reviewing and rejecting poorly fit applicants, and overall poor risk analysis in initial borrower requests. 


```{r, warning = FALSE, fig.width=9, fig.height=9}
loan.cluster <- loan[, c(1:6, 15:17)]
loan.cluster$Loan_status <- as.numeric(loan.cluster$Loan_status)
loan.cluster.group <- kmeans(x = loan.cluster, centers = 2)
loan.clustID <- loan.cluster.group$cluster


plot1 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "wss")
plot2 <- fviz_nbclust(loan.cluster, FUN = hcut, method = "silhouette")


gridExtra::grid.arrange(plot1,plot2)

clusplot(loan.cluster,
         loan.clustID,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T)
```



# Heirarchal Data Clustering - Agglomerative

The following variables were analyzed to determine the borrower's repayment profile:

  - Amount in borrower's saving account `Saving_amount`
  - Borrower's credit score `Credit_score`
  - Loan amount `Amount`
  - Status of bank loan default (Default vs Current) `Loan_status`
  

Utilizing a 'bottom-up' approach the data was split into clusters starting with single data points and successively merging clusters. This allows for tuning of the cluster sizes without pre-determining cluster count.  Therefore, overall borrower profiles can be created from the data without pre-determining segments allowing in improved borrower profiling and prediction.


All data was run both raw and scaled.  Raw data did not present significant skew, and scaled data presented over fitting issues resulting in a best fit of two clusters. No missing values were identified in the data set, therefore no imputation was required. 

```{r}
heirarch.loans <- loan[, c("Checking_amount", "Amount", "Emp_duration", "Loan_status")]
heirarch.distance <- dist(heirarch.loans, method = "euclidean")
hcluster1 <- hclust(heirarch.distance, method = "complete")
```

## Determining Cluster Count

By examining the current borrower data to determine both the number and dimensions of the clusters, the borrower profile clusters can be created to better align with the current borrower profile base.  This will allow for improved accuracy in borrower outcome prediction over the life cycle of the loan.  For example, as a loan progresses, if a borrower originally profiled as a high likelihood of repayment suddenly has a dip in employment duration (i.e. laid off), accurate and timely reassessment of the borrower profile will allow for rapid revocation of any pending lending offers or credit increases, as well as flag the borrower for targeted repayment and or refinancing programs to reduce risk of loan delinquency 

```{r, fig.height=9, fig.width=7}
plot3 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "wss")
plot4 <- fviz_nbclust(heirarch.loans, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot3, plot4)
```



## Final Cluster analysis

Optimal clustering was indicated as 3 clusters via the silhouette method and was selected as the pre-defined number of clusters for hierarchical agglomerate cluster analysis.  Unlike the clustering as seen in the initial k-means cluster analysis, by utilizing hierarchical agglomerate clustering, the clusters are created before the best fit number of clusters is selected.  This unsupervised algorithm allows for the clustering to be determined directly from the data itself rather than being pre-assigned based on assumptions.

This hierarchical clustering is visually represented in the below dendrogram.  The clusters as selected are highlighted visually with different color square overlays. Inside the bounds of each box are the borrowers falling into that cluster.  A scatterplot is also included to demonstrate the mapping of these clusters as assigned by the hclust() function. 

```{r, fig.width=9}

hcluster2 <- hcluster1
hCluster_group <- cutree(hcluster2, k = 3)
heirarch.loans$hCluster_group = hCluster_group

plot(hcluster2, cex = 0.6, labels = FALSE, hang = -1, xlab = "", main = "Dendogram of Borrower Profile Clustering")
par(lwd = 3)
rect.hclust(hcluster1, k = 3, border = 2:9)

par(lwd=1)
clusplot(heirarch.loans,
         hCluster_group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         plotchar = F,
         span = T,
         main = "Cluster Plot of Raw Borrower Data")


```




# Principal Component Analysis (PCA) & Reducing dimensions

To reduce, the dimensions must be scaled. All data transformed via `scale()` function as log scaling not possible with employment duration as some duration = 0 months. ("Checking_amount", "Amount", "Emp_duration", "Loan_status")

```{r}
pca.loans.scale <- heirarch.loans[,-5]

pca.loans.scale$Checking_amount <- scale(pca.loans.scale$Checking_amount)
pca.loans.scale$Amount <- scale(pca.loans.scale$Amount)
pca.loans.scale$Emp_duration <- scale(pca.loans.scale$Emp_duration)
pca.loans.scale$Loan_status <- scale(as.numeric(pca.loans.scale$Loan_status))


pca.loans <- prcomp(pca.loans.scale, center = TRUE, scale = TRUE)


kable(round(pca.loans$rotation, 2), caption = "Factor loading of Borrower Profile PCA") %>% 
  kable_styling()
kable(round(summary(pca.loans)$importance, 3), caption = "Importance of each componant of Borrower Profile PCA") %>% 
  kable_styling()
```


As demonstrated by the above PCA tables, the first three principal components account for 86.6% of the variation in the borrower profile data.  The equation for each principal component is included below: 

$$PC_1= -0.65[Checking.amount] + 0.30[Loan.amount] - 0.21[Employment.duration] + 0.67[Loan.status]$$
$$PC_2= -0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] - 0.03[Loan.status]$$
$$PC_3= 0.31[Checking.amount] + 0.73[Loan.amount] - 0.57[Employment.duration] - 0.20[Loan.status]$$

PC1 most impacted by checking amount and loan status with light impact from loan amount and employment duration, we will call this measure Immediate Financial Status Index
PC2 most impacted by loan amount and employment duration with near 0 impact from checking amount and loan status, we will call this measure Repayment Security Index 

PC3 will be dropped ass PC2 and PC3 have same largest influence variables (loan amount and employment duration), but PC2 has a greater proportion of variance than PC3. 

## Determne Best Number of Clusters

The same process was utilized to determine the best fit number of clusters for the PCA variables, again outlining 3 clusters as the best fit.  Despite this best fit silhouette prediction, four clusters were utilized to reduce overall cluster overlap. 

```{r}

pca.cluster <- data.frame(Imed_fin_stat_PC1 = pca.loans$x[,1],
                          Repay_sec_PC2 = pca.loans$x[,2])

plot5 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "wss")
plot6 <- fviz_nbclust(pca.cluster, FUN = hcut, method = "silhouette")
gridExtra::grid.arrange(plot5, plot6)

```

```{r}


pca.distance <- dist(pca.cluster, method = "euclidean")
pca.clust <- hclust(pca.distance, method = "complete")
pca.cluster.group <- cutree(pca.clust, 4)

heirarch.loans$pcaCluster = as.character(pca.cluster.group)

clusplot(pca.cluster,
         pca.cluster.group,
         lines = 0,
         shade = T,
         color = T,
         labels = 1,
         xlab = "Immediate Financial Status Index",
         ylab = "Repayment Security Index",
         plotchar = F,
         span = T,
         main="Cluster Plot of PCA Components")

```

In reviewing the principal component variables are found to improve borrower profile predictive capabilities through dimension reduction.  By combining four variables into two component analysis variables, the data can be more easily manipulated for borrower outcome predictions at the time of lending and increased targeted marketing for customers indicated to have a strong financial status index and repayment security index.   This allows for borrower segmentation into four clusters:

  - High immediate financial status & High Repayment Security
  - High Immediate Financial Status & Low Repayment Security
  - Low Immediate Financial Status & High Repayment Security
  - Low Immediate Financial Status & Low Repayment Security
  

These four clusters that comprise the borrower profile can allow for a targeted customer based approach.  Those with high financial status and high repayment security can be targeted for low interest rate loans with a higher down payment marketing to entice further borrowing from a low risk borrower.  Where as those with low immediate financial status & high repayment security can be targeted for micro to medium sized loans with minimal to no down payments to entice further borrowing from a low risk borrower that may not have the funds for a traditional high down payment loan. 

# Outlier analysis

Outlier identification can provide further support in both effective targeted marketing, appropriate borrower application denials, and fraudulent borrower application identification. Outliers to the borrower profile groups should be manually reviewed to determine if the borrower is at risk for adjusting one of their principal component classifications (ex: moving from high immediate financial status to low, etc), if the borrower data is potentially fraudulent, or if the borrower data indicates a borrower that requires potential account termination and/or future borrower application denial for lack of repayment ability.  

Additionally, properly identifying outliers allows for removal of outliers prior to PCA or hierarchical agglomerate clustering will result in increased borrower repayment profile classification and improved predictive capacity.

## Local Outlier Factor (LOF) Score

A local outlier factor (LOF) score is calculated utilizing the below equation to compare a data point's local reachability density (LRD) of the nearest k neighbors to point $A_i$ for $i = 1 , 2, ... , n$. 

$$
LOF(A_i) = \frac{\frac{\sum_{i = A_j \in N_kA_i} {LRD_k(A_j)}}{||N_k(A_i)||}} {LRD_k(A_i)}
$$

The LOF factor acts as an easily filterable scale variable to quickly identify outlier values.  An LOF > 1 indicates a potential outlier, with the greater values for LOF indiciating more extreme outliers. 

```{r}

lof.pca.cluster <- lof(pca.cluster, minPts = 50)
pca.cluster$LOF <- lof.pca.cluster

pander::pander(summary(lof.pca.cluster), caption = "Summary statistics of LOF scores for Immediate Financial Status Index & Repayment Security Index")
```

Based on the summary statistics for the calculated LOF scores for the PCA feature variables the outlier cutoff was selected as an LOF value of 1.8.  This value was selected as the cutoff to ensure less than 1% of the dataset is identified as an outlier, leading to an outlier flagging of 1 in every 100 borrowers. Selecting an LOF value of 1.7 resulted in > 1% outlier flagging. This 1% outlier flagging will scale well to the lender or credit agency's operational constraints until the need for hyperparameter (k) tuning is required. 

```{r}
plot(x = pca.cluster$Imed_fin_stat_PC1,
     y = pca.cluster$Repay_sec_PC2,
     pch = "x",
     cex = 0.5,
     xlab = "Immediate Financial Status Index",
     ylab = "Repayment Security Index",
     main="Outlier Identification Amongst PCA Components")
points(pca.cluster,
       cex = ((lof.pca.cluster - 1)*1.5),
       pch = 21,
       col = "hotpink")
text(pca.cluster[lof.pca.cluster > 1.8,],
     labels = round(lof.pca.cluster, 1)[lof.pca.cluster >1.8],
     pos = 2,
     cex = 0.6,
     col = "darkred")

kable(filter(pca.cluster, LOF > 1.8), caption = "listing of all LOF scores > 1.8") %>% 
  kable_styling()
```

# Conclusions

Overall proper borrower profile segmentation will allow for improved loan default prediction models, improved identification of fraudulent pre-approval applications, and improved targeted marketing to drive up borrowing rates from borrowers with a high repayment profile.  Overall this borrower segmentation and classification may also be used for predictive analysis regarding borrower pre-approval determinations.  Overall borrower population segmentation remains a highly effective tool for managing and predicting overall loan outcomes.


# References

Data source:

Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6
Accessed via: <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv</a>