Will You Pay Me Back for That?

class: center, middle, inverse, title-slide

.title[
# Will You Pay Me Back for That?
]
.subtitle[
## Natalie LePera
]
.author[
### West Chester University<br>STA551: Foundations of Data Science
]
.institute[
### Final Project - 12 Dec 2024
]

---

h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: #3b082f;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: #3b082f;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: #3b082f;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: #3b082f;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
}
.h1_nospace { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
    margin-bottom: 1px;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
    margin-bottom: 1px;
}
.h_nospace { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
    margin-bottom: 1px;
    margin-top: 2px;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-weight:bold;
    color: #3b082f;
    text-align: left;
    margin-bottom: 1px;
}

body {
  background-color:white;
}

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
}

h5 {
  color: #3b082f;
  margin-bottom: 1px;
}

.iframe {
  text-align: center;
}

a:link {
  color: darkmagenta;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
}

table {
  background-color: white;
}

.td1 {
  font-weight: bold;
  font-size: 14px;
}

td {
  border-bottom: 1px solid #ddd;
  text-align: left;
  font-size: 12px;
}
th {
  font-variant: small-caps;
  border-bottom: 1px solid #ddd;
  text-align: left;
  font-size: 17px;
}

tr:hover {background-color: coral;}

.inverse {
  background-color: #70384A;
  color:  #f0d7eb;
}

.inverse_light {
  background-color: #f0d7eb;
  color: #3b082f;

</style>

<h1>Introduction</h1>

<h3>Goal:</h3> 
Create accurate and detailed borrower profiles to better predict borrower outcome ensuring timely & complete repayment of loans & prevent fraudulent borrowing<br>

.pull-left[
<h3>Implementation Purposes:</h3>
  - Targeted reduced APR loan marketing for prime borrowers
  - Identification of borrower with high default risk for preemptive mitigation strategies
      - ex: refinancing or adjusted repayment plans
  - Live spending limit adjustments to support borrower needs
  - Improve fraud detection 
]

.pull-right[
<h3>Steps:</h3>
  1. EDA & Dimension Reduction Through Variable Creation
  2. K-means Cluster Analysis
  3. Hierarchical Agglomerative Clustering Analysis
  4. Principal Component Analysis
  5. Outlier Detection
]
<br>
<h3>Data:</h3> Loan default data was obtained from <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">Applied Analytics through Case Studies Using SAS and R, Deepti Gupta</a>
---
class:inverse_light
<h1 class = "h1_nospace">Variable Types & Details</h1>
A high level summary of the data analyzed
<br>

<table>
<thead><tr>
<th>Variable Name</th>
<th>Variable Type</th>
<th>Details</th>
</tr></thead>
<tbody>
<tr><td class = "td1">Loan Status</td><td>Categorical</td><td>Status of bank loan default (Default vs Current)</td></tr>
<tr><td class = "td1">Checking Amount</td><td>Numeric</td><td>Amount in borrower's checking account</td>
<tr><td class = "td1">Term</td><td>Numeric</td><td>Loan term in months</td></tr>
<tr><td class = "td1">Credit Score</td><td>Numeric</td><td>Borrower's credit score</td></tr>
<tr><td class = "td1">Gender</td><td>Categorical</td><td>Borrower's gender</td></tr>
<tr><td class = "td1">Marital Status</td><td>Categorical</td><td>Borrower's marital status</td></tr>
<tr><td class = "td1">Employment Status</td><td>Categorical</td><td>Borrower's employment status</td></tr>
<tr><td class = "td1">Amount</td><td>Numeric</td><td>Loan amount</td></tr>
<tr><td class = "td1">Saving Amount</td><td>Numeric</td><td>Ammount in borrower's saving account</td></tr>
<tr><td class = "td1">Age</td><td>Numeric</td><td>Duration of borrower's employment in months</td></tr>
<tr><td class = "td1">Number of Credit Accounts</td><td>Numeric</td><td>Number of credit accounts in borrower's name</td></tr>
<tr><td class = "td1">Car Loan</td><td>Categorical</td><td>If borrower holds a car loan</td></tr>
<tr><td class = "td1">Personal Loan</td><td>Categorical</td><td>If borrower holds a personal loan</td></tr>
<tr><td class = "td1">Home Loan</td><td>Categorical</td><td>If borrower holds a home loan</td></tr>
<tr><td class = "td1">Education Loan</td><td>Categorical</td><td>If borrower holds an education loan</td></tr>
<tr><td class = "td1">Any Loan</td><td>Categorical</td><td>A feature variable measuring how many of the 4 defined loans held by the borrower (personal, home, education, or car)</td></tr>
<tr><td class = "td1">Total Debt</td><td>Numeric</td><td>A feature variable measuring the total number of borrower's debts (4 defined loans and Number of Credit Accounts)</td></tr>
</tbody>
</table>

---
<h1>EDA & Feature Variable Generation</h1>

.pull-left[
Reducing the number of categorical variables by creating a numeric feature variable allows for improved analysis with reduced noise.
<br><br>

The following variables were combined to create a new variable: <b>Total Debt</b>
  - Car Loan
  - Personal Loan
  - Home Loan
  - Education Loan
  - Number of Credit Accounts

<br>
Total.Debt = Car.loan + Personal.loan + 
            Home.loan + Education.loan + #.of.Credit.Acts
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
<h1 class = "h1_nospace">k-Means Cluster Analysis (Full Data)</h1>
Supervised algorithm creating population segments based a the pre-defined number of clusters (2).<br>
<b>Good for:</b> fitting into pre-determined groups (ex: risk ranking)<br>
<b>Bad for:</b> Adjusting to real world data as received<br>

.pull-left[
![](index_files/figure-html/unnamed-chunk-4-1.png)
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
<h4 class = "h_nospace">Results</h4>
A best fit of 2 clusters with significant overlap indicates a poor cluster model fit. The component variables created by k-Means clustering shown to be an insufficient model, accounting for only 43.2% of variability in the data.<br><br>

]

---
<h1 class = "h_nospace">Hierarchical Data Clustering - Agglomerative</h1>
Unsupervised algorithm creating population segments based on dataset alone. Number of clusters chosen <i>after</i> grouping.<br>
<b>Good for:</b> Grouping population based on current real world evidence<br>
<b>Bad for:</b> Handling datasets with extreme outliers or a significant number of outliers<br><br>

.pull-left[
<br>The following variables were analyzed to determine the borrower's repayment profile:

- Amount in borrower's saving account 
  - Borrower's credit score 
  - Loan amount 
  - Status of bank loan default (Default vs Current) 
  
This unsupervised 'bottom-up' approach allows for adaptive grouping of the population based on the current real world evidence.

<b>Main difference:</b> Clusters created <i>before</i> optimal number determined <br>
<b>Optimal cluster count:</b> 3
]
.pull-right[
<h4>Determining Cluster Count</h4>

![](index_files/figure-html/unnamed-chunk-7-1.png)
]

---
<h1>Cluster Visualization - Dendrogram (Hierarchical Clustered Data)</h1>
.pull-left[
![](index_files/figure-html/unnamed-chunk-8-1.png)
]

.pull-right[
<br><br>
<h4 class = "h_nospace">Interpreting the Dendrogram</h4>

- Each split in the dendogram (tree) captures a split into 2 clusters
  
  - Color square overlays represent each cluster selected
  
  - To adjust number of cluster selected, different height used for cluster selection
    - higher = fewer clusters
    - lower = more clusters
]

---
<h1>Final Cluster Analysis (Heirarchal Clustered Data)</h1>
.pull-left[
![](index_files/figure-html/unnamed-chunk-9-1.png)
]

.pull-right[
<br><h4>Results</h4>
A best fit of 3 clusters with reduced overlap indicates an improved cluster model fit.

The component variables created by heirarchal clustering are shown to be a moderate fit model, accounting for 61.95% of variability in the data.<br><br>

While computationally heavy, this model will provide moderate success in predictive classificaiton of borrowers.
]

---
<h1>Principal Component Analysis (PCA) & Reducing dimensions</h1>

All data transformed via scale function as log scaling not possible with employment duration as some duration = 0 months.

Variables selected for principal component creation: Checking amount, Amount, Employment duration, Loan status

.pull-left[
<table class="table" style="margin-left: auto; margin-right: auto;">
<caption>Factor loading of Borrower Profile PCA</caption>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:right;"> PC1 </th>
   <th style="text-align:right;"> PC2 </th>
   <th style="text-align:right;"> PC3 </th>
   <th style="text-align:right;"> PC4 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Checking_amount </td>
   <td style="text-align:right;"> -0.65 </td>
   <td style="text-align:right;"> -0.01 </td>
   <td style="text-align:right;"> 0.31 </td>
   <td style="text-align:right;"> -0.69 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Amount </td>
   <td style="text-align:right;"> 0.30 </td>
   <td style="text-align:right;"> 0.61 </td>
   <td style="text-align:right;"> 0.73 </td>
   <td style="text-align:right;"> 0.04 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Emp_duration </td>
   <td style="text-align:right;"> -0.21 </td>
   <td style="text-align:right;"> 0.79 </td>
   <td style="text-align:right;"> -0.57 </td>
   <td style="text-align:right;"> -0.07 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Loan_status </td>
   <td style="text-align:right;"> 0.67 </td>
   <td style="text-align:right;"> -0.03 </td>
   <td style="text-align:right;"> -0.20 </td>
   <td style="text-align:right;"> -0.72 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
<table class="table" style="margin-left: auto; margin-right: auto;">
<caption>Importance of each componant of Borrower Profile PCA</caption>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:right;"> PC1 </th>
   <th style="text-align:right;"> PC2 </th>
   <th style="text-align:right;"> PC3 </th>
   <th style="text-align:right;"> PC4 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Standard deviation </td>
   <td style="text-align:right;"> 1.243 </td>
   <td style="text-align:right;"> 1.009 </td>
   <td style="text-align:right;"> 0.949 </td>
   <td style="text-align:right;"> 0.733 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Proportion of Variance </td>
   <td style="text-align:right;"> 0.386 </td>
   <td style="text-align:right;"> 0.254 </td>
   <td style="text-align:right;"> 0.225 </td>
   <td style="text-align:right;"> 0.134 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cumulative Proportion </td>
   <td style="text-align:right;"> 0.386 </td>
   <td style="text-align:right;"> 0.641 </td>
   <td style="text-align:right;"> 0.866 </td>
   <td style="text-align:right;"> 1.000 </td>
  </tr>
</tbody>
</table>
]

<br>As demonstrated by the above PCA tables, the first three principal components account for 64.1% of the variation in the borrower profile data.  The equation for each principal component is included below:

<br>PC(1) = -0.65[Checking.amount] + 0.30[Loan.amount] - 0.21[Employment.duration] + 0.67[Loan.status]
<br>PC(2) = -0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] - 0.03[Loan.status]

---
<h1>Principal Component Analysis (PCA) & Reducing dimensions</h1>
`\(PC_1\)` = Repayment Security Index
`\(PC_2\)` = Repayment Security Index

.pull-left[
<h4>Determine Best Number of Clusters</h4>
The same process was utilized to determine the best fit number of clusters for the PCA variables, again outlining 3 clusters as the best fit.

Despite this, 4 clusters were utilized to reduce overall cluster overlap.

<h4>Cluster Naming & Classification</h4>
borrower segmentation into four clusters:
  - High immediate financial status & High Repayment Security
  - High Immediate Financial Status & Low Repayment Security
  - Low Immediate Financial Status & High Repayment Security
  - Low Immediate Financial Status & Low Repayment Security
]

.pull-right[

![](index_files/figure-html/unnamed-chunk-13-1.png)
]

---
<h1>Local Outlier Factor (LOF) Implementation</h1>

.pull-left[
<h4>Local Outlier Factor (LOF) Score</h4>

A local outlier factor (LOF) score is calculated utilizing the below equation to compare a data point's local reachability density (LRD) of the nearest k neighbors to point `\(A_i\)` 
for `\(i = 1 , 2, ... , n\)`.

]

.pull-right[

<h4>Summary statistics</h4>

LOF scores for Immediate Financial Status Index & Repayment Security Index

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9571  0.9939  1.0170  1.0638  1.0788  2.3155
```
]

The LOF factor acts as an easily filterable scale variable to quickly identify outlier values.  
  - LOF > 1 indicates a potential outlier
    - greater values for LOF indicating more extreme outliers
  - LOF `\(\le\)` 1 indicates not an outlier

<b>LOF Cutoff Selected:</b> 1.8

---
<h1>Outlier Visualization and Listing</h1>

.pull-left[
<h3>Outliers Circled and Numbered Below </h3>
![](index_files/figure-html/unnamed-chunk-15-1.png)
]
.pull-right[
<h3>All LOF scores > 1.8 - Data Outliers </h3>
<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Imed_fin_stat_PC1 </th>
   <th style="text-align:right;"> Repay_sec_PC2 </th>
   <th style="text-align:right;"> LOF </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -2.481105 </td>
   <td style="text-align:right;"> -2.2764407 </td>
   <td style="text-align:right;"> 2.009484 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3.396802 </td>
   <td style="text-align:right;"> 0.6592382 </td>
   <td style="text-align:right;"> 1.882003 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -3.034711 </td>
   <td style="text-align:right;"> -0.0338831 </td>
   <td style="text-align:right;"> 1.888966 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -2.282020 </td>
   <td style="text-align:right;"> -2.8414007 </td>
   <td style="text-align:right;"> 2.315474 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2.926977 </td>
   <td style="text-align:right;"> 2.3990462 </td>
   <td style="text-align:right;"> 1.973623 </td>
  </tr>
</tbody>
</table>

Selecting an LOF value of 1.7 resulted in > 1% outlier flagging. This 1% outlier flagging cap will scale with operational constraints until the need for hyperparameter (k) tuning is required. 
]

---
<h1>Limitations</h1>

Utilizing clustering to predict borrower repayment profile comes with some limitations to consider
  - Hierarchical clustering can require significant computational power
    - Unsupervised models more costly to run
  - k-Means clustering requires manual updating for fine tuning to real world data
  - LOF outlier detection requirers hyperparameter tuning for increased accuracy over time

---
<h1>Conclusions</h1>

Overall proper borrower profile segmentation will allow for:
  - Improved loan default prediction models
  - Improved identification of fraudulent pre-approval applications
  - Improved targeted marketing to drive up borrowing rates from borrowers with a high repayment profile.

Overall this borrower segmentation and classification may also be used for predictive analysis regarding borrower pre-approval determinations.  Overall borrower population segmentation remains a highly effective tool for managing and predicting overall loan outcomes.

---
<h1>References</h1>

<h3>Data source:</h3>

Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6
Accessed via: <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv</a>