class: center, middle, inverse, title-slide .title[ # U.S. Bank Loans ] .subtitle[ ## Sampling Strategies ] .author[ ### Tyler Battaglini & Ryan Lebo ] .date[ ### 2025-03-30 ] --- ## Table of Contents <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Introduction</li> <li>Variables</li> <li>Missing Variables/Converting Variables</li> <li>Variable Rework</li> <li>Default Rates/Discretize GrApprv</li> <li>Study Population/Sample Calculations</li> <li>Sample Methods</li> <li>Comparison of Methods</li> <li>Conclusion</li> </ul> --- ## Introduction <ul style="font-size: 1.6em; line-height: 1.6;"> <li>EDA Bank Loan Data set</li> <li>Combined Dataset</li> <li>Data set provides data from SBA</li> </ul> --- ## Variables <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Mis_Status</li> <li>DisbursementGross</li> <li>BalanceGross</li> <li>ChgOffPrinGr</li> <li>GrAppv</li> <li>SBA_Appv</li> </ul> --- ## Missing Variables/Converting Variables <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Looked for missing Mis_Status variables</li> <li>Converted DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv </li> </ul> --- ## Variable Rework <ul style="font-size: 1.6em; line-height: 1.6;"> <li>6 regional categories</li> <li>Can see patterns</li> <li>Easier for visualization</li> </ul> ``` Mid-Atlantic Midwest Northeast Southeast Southwest Unknown 142486 211424 133479 138304 85461 5733 West 182277 ``` --- ## Default Rates/Discretize GrApprv <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Calculated default rates of SBA-backed loans</li> <li>Compare regions</li> <li>Easier to interpret</li> </ul> <table> <thead> <tr> <th style="text-align:left;"> BankRegion </th> <th style="text-align:right;"> Total_Loans </th> <th style="text-align:right;"> Defaults </th> <th style="text-align:right;"> Default_Rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mid-Atlantic </td> <td style="text-align:right;"> 142486 </td> <td style="text-align:right;"> 29375 </td> <td style="text-align:right;"> 20.616060 </td> </tr> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 211424 </td> <td style="text-align:right;"> 33263 </td> <td style="text-align:right;"> 15.732840 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 133479 </td> <td style="text-align:right;"> 19692 </td> <td style="text-align:right;"> 14.752883 </td> </tr> <tr> <td style="text-align:left;"> Southeast </td> <td style="text-align:right;"> 138304 </td> <td style="text-align:right;"> 30791 </td> <td style="text-align:right;"> 22.263275 </td> </tr> <tr> <td style="text-align:left;"> Southwest </td> <td style="text-align:right;"> 85461 </td> <td style="text-align:right;"> 10167 </td> <td style="text-align:right;"> 11.896655 </td> </tr> <tr> <td style="text-align:left;"> Unknown </td> <td style="text-align:right;"> 5733 </td> <td style="text-align:right;"> 443 </td> <td style="text-align:right;"> 7.727193 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 182277 </td> <td style="text-align:right;"> 33827 </td> <td style="text-align:right;"> 18.558019 </td> </tr> </tbody> </table> --- ## Default Rates/Discretize GrApprv <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Easier to interpret</li> <li>Evenly Distributed</li> <li>Split into 5 groups</li> </ul> ``` Very Low Low Medium High Very High 180997 179102 179575 179677 179813 ``` --- ## Study Population/Sample Calculations <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Remove unknown group</li> <li>Make calculation from sample</li> </ul> <table> <thead> <tr> <th style="text-align:right;"> Mid-Atlantic </th> <th style="text-align:right;"> Midwest </th> <th style="text-align:right;"> Northeast </th> <th style="text-align:right;"> Southeast </th> <th style="text-align:right;"> Southwest </th> <th style="text-align:right;"> West </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 142486 </td> <td style="text-align:right;"> 211424 </td> <td style="text-align:right;"> 133479 </td> <td style="text-align:right;"> 138304 </td> <td style="text-align:right;"> 85461 </td> <td style="text-align:right;"> 182277 </td> </tr> </tbody> </table> --- ## Simple Random Sample <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Perform simple random sample</li> <li>Assigned a unique index to each observation</li> <li>Randomly selected 32,724 observations</li> </ul> <table> <thead> <tr> <th style="text-align:right;"> Size </th> <th style="text-align:right;"> Var.count </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 32724 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> --- ## Systematic Sampling <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Sample from intervals after a given starting point </li> <li>Ensures representation throughout dataset </li> <li>Reduces bias </li> </ul> <table> <thead> <tr> <th style="text-align:right;"> Size </th> <th style="text-align:right;"> Var.count </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 33090 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> --- ## Stratefied Sampling <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Ensures each region is equally represented </li> <li>Improved accuracy </li> <li>Reduces bias </li> </ul> <table> <thead> <tr> <th style="text-align:right;"> Mid-Atlantic </th> <th style="text-align:right;"> Midwest </th> <th style="text-align:right;"> Northeast </th> <th style="text-align:right;"> Southeast </th> <th style="text-align:right;"> Southwest </th> <th style="text-align:right;"> West </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 5219 </td> <td style="text-align:right;"> 7744 </td> <td style="text-align:right;"> 4889 </td> <td style="text-align:right;"> 5066 </td> <td style="text-align:right;"> 3130 </td> <td style="text-align:right;"> 6676 </td> </tr> </tbody> </table> --- ## Cluster Sampling <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Cluster selection bias</li> <li>Potential over/Under representation of population</li> <li> Enhances comparability between groups</li> </ul> <table> <thead> <tr> <th style="text-align:right;"> Size </th> <th style="text-align:right;"> Var.count </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 32305 </td> <td style="text-align:right;"> 29 </td> </tr> </tbody> </table> --- <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Comparison of Methods </div> --- ## Population-level Default Rates <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Population default rates will be used for comparison </li> <li>Rates relatively close to one another </li> </ul> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> no.lab </th> <th style="text-align:right;"> default </th> <th style="text-align:right;"> no.default </th> <th style="text-align:right;"> default.rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mid-Atlantic </td> <td style="text-align:right;"> 141 </td> <td style="text-align:right;"> 29375 </td> <td style="text-align:right;"> 112970 </td> <td style="text-align:right;"> 20.6 </td> </tr> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 33263 </td> <td style="text-align:right;"> 178018 </td> <td style="text-align:right;"> 15.7 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 1396 </td> <td style="text-align:right;"> 19692 </td> <td style="text-align:right;"> 112391 </td> <td style="text-align:right;"> 14.9 </td> </tr> <tr> <td style="text-align:left;"> Southeast </td> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 30791 </td> <td style="text-align:right;"> 107414 </td> <td style="text-align:right;"> 22.3 </td> </tr> <tr> <td style="text-align:left;"> Southwest </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 10167 </td> <td style="text-align:right;"> 75237 </td> <td style="text-align:right;"> 11.9 </td> </tr> <tr> <td style="text-align:left;"> Unknown </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 443 </td> <td style="text-align:right;"> 5236 </td> <td style="text-align:right;"> 7.8 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 107 </td> <td style="text-align:right;"> 33827 </td> <td style="text-align:right;"> 148343 </td> <td style="text-align:right;"> 18.6 </td> </tr> </tbody> </table> --- ## Region-specific Default Rates for Differing Samples <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Little to no variation between methods </li> <li>Almost equal to our default rate population </li> </ul> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> default.rate.pop </th> <th style="text-align:right;"> default.rate.srs </th> <th style="text-align:right;"> default.rate.sys </th> <th style="text-align:right;"> default.rate.str </th> <th style="text-align:right;"> default.rate.cluster </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 20.6 </td> <td style="text-align:right;"> 19.9 </td> <td style="text-align:right;"> 21.2 </td> <td style="text-align:right;"> 20.4 </td> <td style="text-align:right;"> 21.7 </td> </tr> <tr> <td style="text-align:left;"> Mid-Atlantic </td> <td style="text-align:right;"> 15.7 </td> <td style="text-align:right;"> 16.0 </td> <td style="text-align:right;"> 15.8 </td> <td style="text-align:right;"> 15.3 </td> <td style="text-align:right;"> 15.1 </td> </tr> <tr> <td style="text-align:left;"> Southeast </td> <td style="text-align:right;"> 14.9 </td> <td style="text-align:right;"> 15.1 </td> <td style="text-align:right;"> 14.7 </td> <td style="text-align:right;"> 14.4 </td> <td style="text-align:right;"> 15.6 </td> </tr> <tr> <td style="text-align:left;"> Southwest </td> <td style="text-align:right;"> 22.3 </td> <td style="text-align:right;"> 22.1 </td> <td style="text-align:right;"> 22.3 </td> <td style="text-align:right;"> 21.6 </td> <td style="text-align:right;"> 21.5 </td> </tr> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 11.9 </td> <td style="text-align:right;"> 11.3 </td> <td style="text-align:right;"> 11.8 </td> <td style="text-align:right;"> 11.8 </td> <td style="text-align:right;"> 12.9 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 18.6 </td> <td style="text-align:right;"> 17.6 </td> <td style="text-align:right;"> 18.5 </td> <td style="text-align:right;"> 18.6 </td> <td style="text-align:right;"> 20.6 </td> </tr> </tbody> </table> --- ## Visualization <img src="Final-Draft-Presentation-2_files/figure-html/unnamed-chunk-20-1.png" width="100%" /> --- <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Conclusion </div> --- ## Conclusion <ul style="font-size: 2.0em; line-height: 1.8;"> <li> Stratified/Systematic sampling most accurate </li> <li> Representative selection across methods </li> <li> Default rates between differing sampling plans are similar to population default rates </li> <li> Large sample could leads to random variation decreasing </li> </ul> --- ## Limitations <ul style="font-size: 2.0em; line-height: 1.8;"> <li> Economic Factors differ by region </li> <li> Big economical shifts for the 30 years of the dataset</li> <li> Densely populated areas could inflate rates for whole region </li> </ul> --- <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Thank You </div>--- --- ## Contributors <ul style="font-size: 1.8em; line-height: 1.8;"> <li> Ryan Lebo - Slides Beginning to Simple Random Sampling </li> <li> Tyler Battaglini - Slides Systematic Sampling to Conclusion </li> </ul>