class: center, middle, inverse, title-slide .title[ # Bank Loan ] .subtitle[ ## Sampling Strategies ] .author[ ### Tyler Battaglini & Ryan Lebo ] .date[ ### 2025-03-24 ] --- <h1 align="center"> Table of Contents</h1> <BR> .pull-left[ - Introduction - Variables - Missing Variables/Converting Variables - Variable Rework - Default Rates/Discretize GrApprv - Study Population/Sample Calculations - Sample Methods - Comparison of Methods - Conclusion ] --- <h1 align = "center"> Introduction <font color="orange"></font></h1> <BR> .pull-left[ - EDA Bank Loan Data set - Data set provides () ] --- <h1 align = "center"> Variables <font color="orange"></font></h1> <BR> - Mis_Status - DisbursementGross - BalanceGross - ChgOffPrinGr - GrAppv - SBA_Appv --- <h1 align = "center"> Missing Variables/Converting Variables <font color="orange"></font></h1> <BR> - Looked for missing Mis_Status variables - Converted DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv --- ## Variable Rework - 4 regional categories - Can see patterns - Easier for visualization ``` Midwest Northeast South Unknown West 269885 150560 275751 1730 201238 ``` --- <h1 align = "center"> Default Rates/Discretize GrApprv <font color="orange"></font></h1> <BR> - Calculated default rates of SBA-backed loans - Compare regions - Easier to interpret - Evenly distributed ``` # A tibble: 5 × 4 BankRegion Total_Loans Defaults Default_Rate <chr> <int> <int> <dbl> 1 Midwest 269885 42537 15.8 2 Northeast 150560 21021 14.0 3 South 275751 58751 21.3 4 Unknown 1730 106 6.13 5 West 201238 35143 17.5 ``` ``` Very Low Low Medium High Very High 180997 179102 179575 179677 179813 ``` --- <h1 align = "center"> Study Population/Sample Calculations <font color="orange"></font></h1> <BR> - Remove unknown group - Make calculation from sample ``` # A tibble: 5 × 4 BankRegion Total_Loans Defaults Default_Rate <chr> <int> <int> <dbl> 1 Midwest 269885 42537 15.8 2 Northeast 150560 21021 14.0 3 South 275751 58751 21.3 4 Unknown 1730 106 6.13 5 West 201238 35143 17.5 ``` ``` Very Low Low Medium High Very High 180997 179102 179575 179677 179813 ``` --- ## Simple Random Sample - Perform simple random sample - Assigned a unique index to each observation - Randomly selected 32,724 observations --- ## Systematic Sampling <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Sample from intervals after a given starting point </li> <li>Ensures representation throughout dataset </li> <li>Reduces bias </li> </ul> | Size| Var.count| |-----:|---------:| | 33239| 30| --- ## Stratefied Sampling <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Ensures each region is equally represented </li> <li>Improved accuracy </li> <li>Reduces bias </li> </ul> | Midwest| Northeast| South| West| |-------:|---------:|-----:|----:| | 9841| 5490| 10055| 7338| --- ## SBA_Appr Curves <img src="Bank-Loan-Draft-1_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> --- <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Comparison of Methods </div> --- ## Population-level Default Rates <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Population default rates will be used for comparison </li> <li>Rates relatively close to one another </li> </ul> | | no.lab| default| no.default| default.rate| |:---------|------:|-------:|----------:|------------:| |Midwest | 174| 42537| 227174| 15.8| |Northeast | 1427| 21021| 128112| 14.1| |South | 227| 58751| 216773| 21.3| |Unknown | 54| 106| 1570| 6.3| |West | 115| 35143| 165980| 17.5| --- ## Region-Specific Default Rates based on SRS <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Little to no variation between methods </li> <li>Increased and decreased rates </li> </ul> | | default.rate.pop| default.rate.srs| |:---------|----------------:|----------------:| |Midwest | 15.8| 15.7| |Northeast | 14.1| 13.2| |South | 21.3| 21.6| |West | 17.5| 16.7| --- ## Region-specific Rates- Systematics Sample <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Little to no variation between methods </li> <li>Three of our rates increased </li> </ul> | | default.rate.pop| default.rate.srs| default.rate.sys| |:---------|----------------:|----------------:|----------------:| |Midwest | 15.8| 15.7| 15.6| |Northeast | 14.1| 13.2| 13.9| |South | 21.3| 21.6| 21.6| |West | 17.5| 16.7| 17.0| --- ## Region-specific Default Rates- Stratified Sample <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Again little to no variation between methods </li> <li>Almost equal to our default rate population </li> </ul> | | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str| |:---------|----------------:|----------------:|----------------:|----------------:| |Northeast | 15.8| 15.7| 15.6| 16.7| |Midwest | 14.1| 13.2| 13.9| 13.6| |South | 21.3| 21.6| 21.6| 20.8| |West | 17.5| 16.7| 17.0| 17.5| --- ## Visualization <img src="Bank-Loan-Draft-1_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> --- <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Conclusion </div> --- ## Conclusion <ul style="font-size: 1.6em; line-height: 1.6;"> <li>Default rates between differing sampling plans are similar to population default rates </li> <li> Systematic sample slightly overestimates most regions </li> <li> Stratified performs the best regarding accuracy </li> <li> South consistently has the highest default rates across sampling methods </li> </ul> --- ## Limitations <ul style="font-size: 1.8em; line-height: 1.8;"> <li> Economic Factors differ by region </li> <li> Wide regions with bigger and/or more states could skew data for that region </li> <li> Distribution of wealth between regions </li> </ul> --- ## Contributors <ul style="font-size: 1.8em; line-height: 1.8;"> <li> Ryan Lebo - Slides Beginning to Simple Random Sampling </li> <li> Tyler Battaglini - Slides Systematic Sampling to Conclusion </li> </ul>