class: center, middle, inverse, title-slide .title[ # Random Sampling for U.S Bank Loans ] .subtitle[ ##
] .author[ ### Josie Gallop, Chloe Winters, Ava Destefano ] .date[ ### 2025-03-24 ] --- # Agenda <font size = 5> .pull-left[ - Introduction - Variables - Practical Questions - Stratification Variable - Random Samples - Calculating Loan Default Rate - Visual Representation - Results - Conclusion and Recommendations ] <BR> <BR> </font> --- # Introduction <font size = 6> .pull-left[ - Data collected from the U.S Small Business Administration. - Collected from 1987-2014. - 899,164 observations of 27 variables. - Each observation is a loan - Loan default status is our reference variable ] <BR> <BR> </font> --- ## Variables <font size = 5> .pull-left[ - LoanNr_ChkDgt - Name - City - State - Zip - Bank - BankState - NAICS - ApprovalDate - ApprovalIFY - Term - NoEmp - NewExist - CreateJob ] .pull-right[ - RetainedJob - FranchiseCode - UrbanRural - RevLineCr - LowDoc - ChgOffDate - DisbursementDate - DisbursementGross - BalanceGross - MIS_Status - ChgOffPrinGr - GrAppv - SBA_Appv ] <BR> <BR> </font> --- ## Practical Qustions <font size = 6> • How can our stratification variable positively impact the quality of our analysis? <BR> <BR> • How will our stratification variable impact loan default rates?<BR> <BR> </font> --- ## Data Download and Cleaning <font size = 6> • Data is originally nine sets, about 100,000 observations each <BR> <BR> • Combine into one data set<BR> <BR> • Deleting observations where MIS_Status is missing leaves us with 899,023 observations <BR> <BR> </font> --- ## Statification Variable <font size = 6> • Each state will be grouped into one of 5 regions in the United States <BR> <BR> • Midwest, Northeast, South, Southeast, and West <BR> <BR> • This new Stratification variable called 'Region' will be added to our data set <BR> <BR> </font> | Midwest| Northeast| South| Southeast| West| |-------:|---------:|------:|---------:|------:| | 202538| 204001| 112640| 116761| 263069| <!-- Start of Josie's Slides --> --- # Loan Default Rates <font size = 6> • We will calculate the loan default rates for the 5 regions <BR> <BR> • Midwest: default rate of 15.9 <BR> <BR> • Northeast: default rate of 16.3 <BR> <BR> • South: default rate of 18.7 <BR> <BR> • Southeast: default rate of 22.5 <BR> <BR> • West: default rate of 17.2 <BR> <BR> </font> --- # Simple Random Sampling <font size = 6> • We will begin with a simple random sample <BR> <BR> • We will use a sample size of 3,000 <BR> <BR> • We have a sample size of 3,000 with 29 variables <BR> <BR> </font> --- # Simple Random Sampling Default Rates <font size = 6> • Midwest: 15.9 (population) and 16.0 (SRS) <BR> <BR> • Northeast: 16.3 (population) and 15.8 (SRS) <BR> <BR> • Southeast: 18.7 (population) and 17.4 (SRS) <BR> <BR> • West: 22.5 (population) and 26.0 (SRS) <BR> <BR> • South: 17.2 (population) and 16.2 (SRS) <BR> <BR> </font> --- # Systematic Random Sample <font size = 6> • We will next take a stratified random sample <BR> <BR> • Once again, we will use a sample size of 3,000 <BR> <BR> • We actually have a size of 3,007 because of the jump size rounding up <BR> </font> --- # Systematic Random Sample Default Rates <font size = 6> • Midwest: 15.6 (systematic) <BR> <BR> • Northeast: 15.9 (systematic) <BR> <BR> • South: 21.1 (systematic) <BR> <BR> • Southeast: 23.9 (systematic) <BR> <BR> • West: 16.7 (systematic) <BR> <BR> </font> --- # Stratified Random Sample <font size = 6> • We will take a stratified random sample based on the region variable <BR> <BR> • Midwest: strata size of 676<BR> <BR> • Northeast: strata size of 681 <BR> <BR> • South: strata size of 376<BR> <BR> • Southeast: strata size of 390<BR> <BR> • West: strata size of 878<BR> <BR> </font> --- # Cluster Sample <font size = 6> • Lastly, we will take a cluster sample <BR> <BR> • The clusters will be based on ZIP code <BR> <BR> </font> --- # Stratified and Cluster Sample Default Rates <font size = 6> • Midwest: 16.3 (stratified) and 25.0 (cluster) <BR> <BR> • Northeast: 18.7 (stratified) and 10.2 (cluster)<BR> <BR> • Southeast: 21.5 (stratified) and 20.6 (cluster) <BR> <BR> • West: 25.1 (stratified) and 13.3 (cluster)<BR> <BR> • South: 16.6 (stratified) and 16.9 (cluster) <BR> <BR> </font> <!-- Start of Chloes Slides --> --- class:inverse middle center name:Visual # Visual Representation --- <img src="https://chloewinters79.github.io/STA490/Image/Cluster%20Graph.png" height=70%> --- # Visual Representation Discussion <font size = 6> • Compare population default to sample default rates <BR> <BR> • Cluster seems to be the most different from the population <BR> <BR> • Stratified seem to be the most similar to the population <BR> <BR> • While closer than Cluster, Simple and Systematic are not as similar to the population as Stratified <BR> <BR> </font> --- class:inverse middle center name:general # Results --- # Results <font size = 6> • It appears that the sample closest to the population is stratified <BR> <BR> • Based on the comparison of the numerical rates and the graph <BR> <BR> • The other samples do have their benefits <BR> <BR> </font> --- class:inverse middle center name:general # Conclusion & Recommendations --- # Conclusion & Recommendations <font size = 6> • Stratified is the best sample based on our analysis <BR> <BR> • However, depending on the goals and resources other samples may be preferred <BR> <BR> • Simple and Systematic were still very close to the population <BR> <BR> • Graph shows Midwest has the highest loan default rate <BR> <BR> • Northeast appears to have the lowest loan default rate <BR> <BR> </font> --- class: inverse center middle # Q & A --- name: Thank you class: inverse center middle # Thank you! Slides created using R packages: [**xaringan**](https://github.com/yihui/xaringan)<br> [**gadenbuie/xaringanthemer**](https://github.com/gadenbuie/xaringanthemer)<br> [**knitr**](http://yihui.name/knitr)<br> [**R Markdown**](https://rmarkdown.rstudio.com)<br> via <br> [**RStudio Desktop**](https://posit.co/download/rstudio-desktop/)