class: center, middle, inverse, title-slide .title[ # Random Sampling for U.S Bank Loans ] .subtitle[ ##
] .author[ ### Josie Gallop, Chloe Winters, Ava Destefano ] .date[ ### 2025-03-29 ] --- # Agenda <font size = 5> .pull-left[ - Introduction - Variables - Practical Questions - Stratification Variable - Random Samples - Calculating Loan Default Rate - Visual Representation - Results - Conclusion and Recommendations ] <BR> <BR> </font> --- # Introduction <font size = 6> .pull-left[ - Data collected from the U.S Small Business Administration. - Collected from 1987-2014. - 899,164 observations of 27 variables. - Each observation is a loan ] <BR> <BR> </font> --- ## Variables <font size = 5> .pull-left[ - LoanNr_ChkDgt - Name - City - State - Zip - Bank - BankState - NAICS - ApprovalDate - ApprovalIFY - Term - NoEmp - NewExist - CreateJob ] .pull-right[ - RetainedJob - FranchiseCode - UrbanRural - RevLineCr - LowDoc - ChgOffDate - DisbursementDate - DisbursementGross - BalanceGross - MIS_Status - ChgOffPrinGr - GrAppv - SBA_Appv ] <BR> <BR> </font> --- ## Practical Qustions <font size = 6> • Which type of sampling plan will perform the best? <BR> <BR> • How can our stratification variable positively impact the quality of our analysis? <BR> <BR> </font> --- ## Data Download and Cleaning <font size = 6> • Data is originally nine sets, about 100,000 observations each <BR> <BR> • Combine into one data set<BR> <BR> • Deleting observations where MIS_Status is missing leaves us with 899,023 observations <BR> <BR> • Made clusters for sampling later <BR?> </font> --- ## Statification Variable <font size = 6> • Each state will be grouped into one of 5 regions in the United States <BR> <BR> • Midwest, Northeast, South, Southeast, and West <BR> <BR> • This new Stratification variable called 'Region' will be added to our data set <BR> <BR> </font> | Midwest| Northeast| South| Southeast| West| |-------:|---------:|------:|---------:|------:| | 202538| 204001| 112640| 116761| 263069| <!-- Start of Josie's Slides --> --- # Loan Default Rates <font size = 5> • We will calculate the loan default rates for the 5 regions <BR> <BR> • Midwest has the lowest default rate <BR> <BR> • Southeast has the highest default rate <BR> <BR> </font> | | no.lab| default| no.default| default.rate| |:---------|------:|-------:|----------:|------------:| |Midwest | 398| 32063| 170077| 15.9| |Northeast | 1205| 33013| 169783| 16.3| |South | 128| 21020| 91492| 18.7| |Southeast | 67| 26206| 90488| 22.5| |West | 196| 45239| 217634| 17.2| --- # Simple Random Sampling <font size = 6> • We will begin with a simple random sample <BR> <BR> • We have a sample size of 3,000 with 29 variables <BR> <BR> </font> --- # SRS Default Rates <font size = 6> • Overall close default rates to population <BR> <BR> • The SRS sampling process gave close default rates <BR> <BR> </font> Table: Comparison of Region-specific default rates between population and the SRS. | | default.rate.pop| default.rate.srs| |:---------|----------------:|----------------:| |Midwest | 15.9| 16.4| |Northeast | 16.3| 15.1| |Southeast | 18.7| 18.5| |West | 22.5| 23.4| |South | 17.2| 18.4| --- # Systematic Random Sample <font size = 6> • We will next take a systematic random sample <BR> <BR> • Once again, we will use a sample size of 3,000 <BR> <BR> • We actually have a size of 3,007 because of the jump size rounding up <BR> </font> --- # Systematic Default Rates <font size = 6> • Similar to SRS, the default rates are close to the population <BR> <BR> • The systematic sample did well with the default rates <BR> <BR> </font> Table: Comparison of Region-specific default rates between population, SRS, and Systematic Sample. | | default.rate.pop| default.rate.srs| default.rate.sys| |:---------|----------------:|----------------:|----------------:| |Midwest | 15.9| 16.4| 15.8| |Northeast | 16.3| 15.1| 15.1| |South | 18.7| 18.5| 16.1| |Southeast | 22.5| 23.4| 21.4| |West | 17.2| 18.4| 15.9| --- # Stratified Random Sample <font size = 6> • We will take a stratified random sample based on the region variable <BR> <BR> • Midwest: strata size of 676 <BR> <BR> • Northeast: strata size of 681 <BR> <BR> • South: strata size of 376 <BR> <BR> • Southeast: strata size of 390 <BR> <BR> • West: strata size of 878 <BR> <BR> </font> --- # Cluster Sample <font size = 6> • Lastly, we will take a cluster sample <BR> <BR> • The clusters will be based on ZIP code <BR> <BR> • Default rate = total defaults / total loans <BR> <BR> </font> --- # Stratified and Cluster Default Rates <font size = 6> • The stratified sample default rates are very close to the population <BR> <BR> • The cluster default rates are not close to the population <BR> <BR> </font> Table: Comparison of Region-specific default rates between Population, SRS, Systematic Sample, Stratified Sample, and Cluster Sample. | | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str| default.rate.cluster| |:---------|----------------:|----------------:|----------------:|----------------:|--------------------:| |Midwest | 15.9| 16.4| 15.8| 18.1| 25.0| |Northeast | 16.3| 15.1| 15.1| 15.5| 10.2| |Southeast | 18.7| 18.5| 16.1| 18.4| 20.6| |West | 22.5| 23.4| 21.4| 25.6| 13.3| |South | 17.2| 18.4| 15.9| 17.2| 16.9| --- # Discussion of Default Rates <font size = 6> • The cluster sample was furthest from the population <BR> <BR> • Stratified was the closest to the population <BR> <BR> • SRS and systematic were close but not as close as stratified <BR> <BR> • Stratified seems like the best process <BR> <BR> </font> <!-- Start of Chloes Slides --> --- class:inverse middle center name:Visual # Visual Representation --- .pull-center[ {.stretch} ] --- # Visual Representation Discussion <font size = 6> • Compare population default to sample default rates <BR> <BR> • Cluster seems to be the most different from the population <BR> <BR> • Stratified seem to be the most similar to the population <BR> <BR> • While closer than Cluster, Simple and Systematic are not as similar to the population as Stratified <BR> <BR> </font> --- class:inverse middle center name:general # Results --- # Average Difference From Population <font size = 6> • Simple Random Sample: 1.14 <BR> <BR> • Systematic Sample: 1.24 <BR> <BR> • Stratified Sample: 0.7 <BR> <BR> • Cluster Sample: 5.32 <BR> <BR> </font> --- # Average Loan Default Rate <font size=6> .pull-left[ **Regions With Cluster Sample** - Midwest: 17.78 - Northeast: 14.78 - South: 19.32 - Southeast: 21 - West: 17.2 ] .pull-right[ **Regions Without Cluster Sample** - Midwest: 15.975 - Northeast: 15.925 - South: 19 - Southeast: 22.925 - West: 17.275 ] </font> --- # Final Results <font size = 6> • It appears that the sample closest to the population is stratified. <BR> <BR> • Based on the comparison of numerical rates and the graph. <BR> <BR> • The other samples do have their benefits. <BR> <BR> </font> --- class:inverse middle center name:general # Conclusion & Recommendations --- # Conclusion & Recommendations <font size = 6> • Stratified is the best sample based on our analysis <BR> <BR> • However, depending on the goals and resources other samples may be preferred <BR> <BR> • Simple and Systematic were still very close to the population <BR> <BR> • Based on its difference from the population, it is recommended to avoid Cluster <BR> <BR> </font> --- class: inverse center middle # Q & A --- name: Thank you class: inverse center middle # Thank you! Slides created using R packages: [**xaringan**](https://github.com/yihui/xaringan)<br> [**gadenbuie/xaringanthemer**](https://github.com/gadenbuie/xaringanthemer)<br> [**knitr**](http://yihui.name/knitr)<br> [**R Markdown**](https://rmarkdown.rstudio.com)<br> via <br> [**RStudio Desktop**](https://posit.co/download/rstudio-desktop/) --- class: inverse center middle # Slides <font size = 6> • Ava: 1-7 <BR> <BR> • Josie: 8-16, 28 <BR> <BR> • Chloe: 17-26 <BR> <BR>