class: center, middle, inverse, title-slide .title[ # Will You Pay Me Back for That? ] .subtitle[ ## Natalie LePera ] .author[ ### West Chester University
STA551: Foundations of Data Science ] .institute[ ### Final Project - 12 Dec 2024 ] --- <style type="text/css"> .title-slide { background-image: url(https://nlepera.github.io/sta553/w03_Penguin/img/penguin_cute.png); background-position: bottom right; background-size: 300px; background-color: #70384A; padding-left: 100px; /* delete this for 4:3 aspect ratio */ } h1.title { /* Title - font specifications of the report title */ font-weight:bold; color: #3b082f; } h1.subtitle { /* Title - font specifications of the report title */ font-weight:bold; color: #3b082f; } h4.author { /* Header 4 - font specifications for authors */ font-family: system-ui; color: #3b082f; } h4.date { /* Header 4 - font specifications for the date */ font-family: system-ui; color: #3b082f; } h1 { /* Header 1 - font specifications for level 1 section title */ font-weight:bold; color: #3b082f; text-align: left; } .h1_nospace { /* Header 1 - font specifications for level 1 section title */ font-weight:bold; color: #3b082f; text-align: left; margin-bottom: 1px; } h2 { /* Header 2 - font specifications for level 2 section title */ font-weight:bold; color: #3b082f; text-align: left; } h3 { /* Header 3 - font specifications of level 3 section title */ font-weight:bold; color: #3b082f; text-align: left; margin-bottom: 1px; } .h_nospace { /* Header 3 - font specifications of level 3 section title */ font-weight:bold; color: #3b082f; text-align: left; margin-bottom: 1px; margin-top: 2px; } h4 { /* Header 4 - font specifications of level 4 section title */ font-weight:bold; color: #3b082f; text-align: left; margin-bottom: 1px; } body { background-color:white; } .highlightme { background-color:yellow; } p { background-color:white; } h5 { color: #3b082f; margin-bottom: 1px; } .iframe { text-align: center; } a:link { color: darkmagenta; } .figlabel { text-align: center; color: darkslategray; font-weight: bold; font-size: 18; } table { background-color: white; } .td1 { font-weight: bold; font-size: 14px; } td { border-bottom: 1px solid #ddd; text-align: left; font-size: 12px; } th { font-variant: small-caps; border-bottom: 1px solid #ddd; text-align: left; font-size: 17px; } tr:hover {background-color: coral;} .inverse { background-color: #70384A; color: #f0d7eb; } .inverse_light { background-color: #f0d7eb; color: #3b082f; </style> <h1>Introduction</h1> <h3>Goal:</h3> Create accurate and detailed borrower profiles to better predict borrower outcome ensuring timely & complete repayment of loans & prevent fraudulent borrowing<br> .pull-left[ <h3>Implementation Purposes:</h3> - Targeted reduced APR loan marketing for prime borrowers - Identification of borrower with high default risk for preemptive mitigation strategies - ex: refinancing or adjusted repayment plans - Live spending limit adjustments to support borrower needs - Improve fraud detection ] .pull-right[ <h3>Steps:</h3> 1. EDA & Dimension Reduction Through Variable Creation 2. K-means Cluster Analysis 3. Hierarchical Agglomerative Clustering Analysis 4. Principal Component Analysis 5. Outlier Detection ] <br> <h3>Data:</h3> Loan default data was obtained from <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">Applied Analytics through Case Studies Using SAS and R, Deepti Gupta</a> --- class:inverse_light <h1 class = "h1_nospace">Variable Types & Details</h1> A high level summary of the data analyzed <br> <table> <thead><tr> <th>Variable Name</th> <th>Variable Type</th> <th>Details</th> </tr></thead> <tbody> <tr><td class = "td1">Loan Status</td><td>Categorical</td><td>Status of bank loan default (Default vs Current)</td></tr> <tr><td class = "td1">Checking Amount</td><td>Numeric</td><td>Amount in borrower's checking account</td> <tr><td class = "td1">Term</td><td>Numeric</td><td>Loan term in months</td></tr> <tr><td class = "td1">Credit Score</td><td>Numeric</td><td>Borrower's credit score</td></tr> <tr><td class = "td1">Gender</td><td>Categorical</td><td>Borrower's gender</td></tr> <tr><td class = "td1">Marital Status</td><td>Categorical</td><td>Borrower's marital status</td></tr> <tr><td class = "td1">Employment Status</td><td>Categorical</td><td>Borrower's employment status</td></tr> <tr><td class = "td1">Amount</td><td>Numeric</td><td>Loan amount</td></tr> <tr><td class = "td1">Saving Amount</td><td>Numeric</td><td>Ammount in borrower's saving account</td></tr> <tr><td class = "td1">Age</td><td>Numeric</td><td>Duration of borrower's employment in months</td></tr> <tr><td class = "td1">Number of Credit Accounts</td><td>Numeric</td><td>Number of credit accounts in borrower's name</td></tr> <tr><td class = "td1">Car Loan</td><td>Categorical</td><td>If borrower holds a car loan</td></tr> <tr><td class = "td1">Personal Loan</td><td>Categorical</td><td>If borrower holds a personal loan</td></tr> <tr><td class = "td1">Home Loan</td><td>Categorical</td><td>If borrower holds a home loan</td></tr> <tr><td class = "td1">Education Loan</td><td>Categorical</td><td>If borrower holds an education loan</td></tr> <tr><td class = "td1">Any Loan</td><td>Categorical</td><td>A feature variable measuring how many of the 4 defined loans held by the borrower (personal, home, education, or car)</td></tr> <tr><td class = "td1">Total Debt</td><td>Numeric</td><td>A feature variable measuring the total number of borrower's debts (4 defined loans and Number of Credit Accounts)</td></tr> </tbody> </table> --- <h1>EDA & Feature Variable Generation</h1> .pull-left[ Reducing the number of categorical variables by creating a numeric feature variable allows for improved analysis with reduced noise. <br><br> The following variables were combined to create a new variable: <b>Total Debt</b> - Car Loan - Personal Loan - Home Loan - Education Loan - Number of Credit Accounts <br> Total.Debt = Car.loan + Personal.loan + Home.loan + Education.loan + #.of.Credit.Acts ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- <h1 class = "h1_nospace">k-Means Cluster Analysis (Full Data)</h1> Supervised algorithm creating population segments based a the pre-defined number of clusters (2).<br> <b>Good for:</b> fitting into pre-determined groups (ex: risk ranking)<br> <b>Bad for:</b> Adjusting to real world data as received<br> .pull-left[ <!-- --> ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> <h4 class = "h_nospace">Results</h4> A best fit of 2 clusters with significant overlap indicates a poor cluster model fit. The component variables created by k-Means clustering shown to be an insufficient model, accounting for only 43.2% of variability in the data.<br><br> ] --- <h1 class = "h_nospace">Hierarchical Data Clustering - Agglomerative</h1> Unsupervised algorithm creating population segments based on dataset alone. Number of clusters chosen <i>after</i> grouping.<br> <b>Good for:</b> Grouping population based on current real world evidence<br> <b>Bad for:</b> Handling datasets with extreme outliers or a significant number of outliers<br><br> .pull-left[ <br>The following variables were analyzed to determine the borrower's repayment profile: - Amount in borrower's saving account - Borrower's credit score - Loan amount - Status of bank loan default (Default vs Current) This unsupervised 'bottom-up' approach allows for adaptive grouping of the population based on the current real world evidence. <b>Main difference:</b> Clusters created <i>before</i> optimal number determined <br> <b>Optimal cluster count:</b> 3 ] .pull-right[ <h4>Determining Cluster Count</h4> <!-- --> ] --- <h1>Cluster Visualization - Dendrogram (Hierarchical Clustered Data)</h1> .pull-left[ <!-- --> ] .pull-right[ <br><br> <h4 class = "h_nospace">Interpreting the Dendrogram</h4> - Each split in the dendogram (tree) captures a split into 2 clusters - Color square overlays represent each cluster selected - To adjust number of cluster selected, different height used for cluster selection - higher = fewer clusters - lower = more clusters ] --- <h1>Final Cluster Analysis (Heirarchal Clustered Data)</h1> .pull-left[ <!-- --> ] .pull-right[ <br><h4>Results</h4> A best fit of 3 clusters with reduced overlap indicates an improved cluster model fit. The component variables created by heirarchal clustering are shown to be a moderate fit model, accounting for 61.95% of variability in the data.<br><br> While computationally heavy, this model will provide moderate success in predictive classificaiton of borrowers. ] --- <h1>Principal Component Analysis (PCA) & Reducing dimensions</h1> All data transformed via scale function as log scaling not possible with employment duration as some duration = 0 months. Variables selected for principal component creation: Checking amount, Amount, Employment duration, Loan status .pull-left[ <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>Factor loading of Borrower Profile PCA</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> PC1 </th> <th style="text-align:right;"> PC2 </th> <th style="text-align:right;"> PC3 </th> <th style="text-align:right;"> PC4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Checking_amount </td> <td style="text-align:right;"> -0.65 </td> <td style="text-align:right;"> -0.01 </td> <td style="text-align:right;"> 0.31 </td> <td style="text-align:right;"> -0.69 </td> </tr> <tr> <td style="text-align:left;"> Amount </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 0.61 </td> <td style="text-align:right;"> 0.73 </td> <td style="text-align:right;"> 0.04 </td> </tr> <tr> <td style="text-align:left;"> Emp_duration </td> <td style="text-align:right;"> -0.21 </td> <td style="text-align:right;"> 0.79 </td> <td style="text-align:right;"> -0.57 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;"> Loan_status </td> <td style="text-align:right;"> 0.67 </td> <td style="text-align:right;"> -0.03 </td> <td style="text-align:right;"> -0.20 </td> <td style="text-align:right;"> -0.72 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>Importance of each componant of Borrower Profile PCA</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> PC1 </th> <th style="text-align:right;"> PC2 </th> <th style="text-align:right;"> PC3 </th> <th style="text-align:right;"> PC4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Standard deviation </td> <td style="text-align:right;"> 1.243 </td> <td style="text-align:right;"> 1.009 </td> <td style="text-align:right;"> 0.949 </td> <td style="text-align:right;"> 0.733 </td> </tr> <tr> <td style="text-align:left;"> Proportion of Variance </td> <td style="text-align:right;"> 0.386 </td> <td style="text-align:right;"> 0.254 </td> <td style="text-align:right;"> 0.225 </td> <td style="text-align:right;"> 0.134 </td> </tr> <tr> <td style="text-align:left;"> Cumulative Proportion </td> <td style="text-align:right;"> 0.386 </td> <td style="text-align:right;"> 0.641 </td> <td style="text-align:right;"> 0.866 </td> <td style="text-align:right;"> 1.000 </td> </tr> </tbody> </table> ] <br>As demonstrated by the above PCA tables, the first three principal components account for 64.1% of the variation in the borrower profile data. The equation for each principal component is included below: <br>PC(1) = -0.65[Checking.amount] + 0.30[Loan.amount] - 0.21[Employment.duration] + 0.67[Loan.status] <br>PC(2) = -0.01[Checking.amount] + 0.61[Loan.amount] + 0.79[Employment.duration] - 0.03[Loan.status] --- <h1>Principal Component Analysis (PCA) & Reducing dimensions</h1> `\(PC_1\)` = Repayment Security Index `\(PC_2\)` = Repayment Security Index .pull-left[ <h4>Determine Best Number of Clusters</h4> The same process was utilized to determine the best fit number of clusters for the PCA variables, again outlining 3 clusters as the best fit. Despite this, 4 clusters were utilized to reduce overall cluster overlap. <h4>Cluster Naming & Classification</h4> borrower segmentation into four clusters: - High immediate financial status & High Repayment Security - High Immediate Financial Status & Low Repayment Security - Low Immediate Financial Status & High Repayment Security - Low Immediate Financial Status & Low Repayment Security ] .pull-right[ <!-- --> ] --- <h1>Local Outlier Factor (LOF) Implementation</h1> .pull-left[ <h4>Local Outlier Factor (LOF) Score</h4> A local outlier factor (LOF) score is calculated utilizing the below equation to compare a data point's local reachability density (LRD) of the nearest k neighbors to point `\(A_i\)` for `\(i = 1 , 2, ... , n\)`. <img src="https://nlepera.github.io/sta551/Screenshot%202024-12-12%20234238.png" style="width:300px"> ] .pull-right[ <h4>Summary statistics</h4> LOF scores for Immediate Financial Status Index & Repayment Security Index ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.9571 0.9939 1.0170 1.0638 1.0788 2.3155 ``` ] The LOF factor acts as an easily filterable scale variable to quickly identify outlier values. - LOF > 1 indicates a potential outlier - greater values for LOF indicating more extreme outliers - LOF `\(\le\)` 1 indicates not an outlier <b>LOF Cutoff Selected:</b> 1.8 --- <h1>Outlier Visualization and Listing</h1> .pull-left[ <h3>Outliers Circled and Numbered Below </h3> <!-- --> ] .pull-right[ <h3>All LOF scores > 1.8 - Data Outliers </h3> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Imed_fin_stat_PC1 </th> <th style="text-align:right;"> Repay_sec_PC2 </th> <th style="text-align:right;"> LOF </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -2.481105 </td> <td style="text-align:right;"> -2.2764407 </td> <td style="text-align:right;"> 2.009484 </td> </tr> <tr> <td style="text-align:right;"> 3.396802 </td> <td style="text-align:right;"> 0.6592382 </td> <td style="text-align:right;"> 1.882003 </td> </tr> <tr> <td style="text-align:right;"> -3.034711 </td> <td style="text-align:right;"> -0.0338831 </td> <td style="text-align:right;"> 1.888966 </td> </tr> <tr> <td style="text-align:right;"> -2.282020 </td> <td style="text-align:right;"> -2.8414007 </td> <td style="text-align:right;"> 2.315474 </td> </tr> <tr> <td style="text-align:right;"> 2.926977 </td> <td style="text-align:right;"> 2.3990462 </td> <td style="text-align:right;"> 1.973623 </td> </tr> </tbody> </table> Selecting an LOF value of 1.7 resulted in > 1% outlier flagging. This 1% outlier flagging cap will scale with operational constraints until the need for hyperparameter (k) tuning is required. ] --- <h1>Limitations</h1> Utilizing clustering to predict borrower repayment profile comes with some limitations to consider - Hierarchical clustering can require significant computational power - Unsupervised models more costly to run - k-Means clustering requires manual updating for fine tuning to real world data - LOF outlier detection requirers hyperparameter tuning for increased accuracy over time --- <h1>Conclusions</h1> Overall proper borrower profile segmentation will allow for: - Improved loan default prediction models - Improved identification of fraudulent pre-approval applications - Improved targeted marketing to drive up borrowing rates from borrowers with a high repayment profile. Overall this borrower segmentation and classification may also be used for predictive analysis regarding borrower pre-approval determinations. Overall borrower population segmentation remains a highly effective tool for managing and predicting overall loan outcomes. --- <h1>References</h1> <h3>Data source:</h3> Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6 Accessed via: <a href="https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv">https://pengdsci.github.io/datasets/LoanData2/BankLoanDefaultDataset.csv</a>