class: center, middle, inverse, title-slide .title[ # Sampling Methods with Bank Loan Data ] .author[ ### Gianna LaFrance & Haley Koprivsek ] --- <h1 align = "center"> Table of Contents </h1> <center><font size="7"> - Introduction <br> - Exploratory Data Analysis <br> - Sampling Process and Default Rates<br> - Simple Random <br> - Systematic <br> - Stratified <br> - Cluster <br> - Visualizations<br> - Conclusion & Discussion </font></center> --- name: colors ## Introduction <font size="5"> - Bank Loan dataset - just under 900,000 observations - Population: loan applications submitted to banks by small businesses with a partial warranty from the Small Business Association - Four different samples of about 4,000 observations taken using four different sampling methods - SRS - Systematic - Stratified - Cluster - Observations divided into regions for analysis & sampling purposes - Regional loan default rates compared among different sampling methods - Response: MIS_Status (Loan paid off or defaulted) --- name: colors ## Exploratory Data Analysis <font size="5"> .pull-left[ - Data split up into nine subsets, combined back into one master data set - Observation with missing values for loan status variable removed from analytic data set - Data stratified by newly created Region variable (Northeast, Midwest, South, West), based on borrower's state - List of all unique ZIP codes generated, split into clusters for sampling purposes later in analysis - Study population: observations from original population which had some loan status listed and which fell into one of the four U.S. geographic regions defined by Region variable (896,872 total observations) ] .pull-right[ <img src="USAmapUpdated.png" width="100%" /> ] --- class: inverse center middle # Sampling Methods --- name: colors ## Simple Random Sampling .pull-left[ - 4,000 observations (target sample size for analysis) chosen at random across entire sampling frame - Regional default rate = (# of observations from subgroup in sample)/(amount of those observations for which **MIS_Status** = "CHGOFF") - Expressed as a percentage <img src="SRS_vis.png" width="550px" height="275px" /> ] .pull-right[ <table> <caption>Regional Default Rates (SRS)</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> default_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 17.54190 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 16.60819 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 20.40073 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 16.92708 </td> </tr> </tbody> </table> ] --- name: colors ## Systematic Random Sampling .pull-left[ - First observation selected at random, then every *i*-th subsequent observation selected where *i* is chosen jump size - Jump size determined by dividing number of observations in sampling frame by target sample size - 4,004 observations selected, 4 removed at random to achieve target sample size of 4,000 <img src="systematic_vis.png" width="550px" height="300px" /> ] .pull-right[ <table> <caption>Regional Default Rates (Systematic)</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> default_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 16.89840 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 16.16628 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 21.55412 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 18.24687 </td> </tr> </tbody> </table> ] --- name: colors ## Stratified Random Sampling .pull-left[ - Study population stratified into four subpopulations based on **Region** variable - About 4,000 observations selected by taking random sample from each stratum, maintaining the proportions of each of the subpopulations to the entire study population - Subpopulation samples combined to create stratified sample (3,999 total observations) <img src="stratified_sys.png" width="550px" height="300px" /> ] .pull-right[ <table> <caption>Regional Default Rates (Stratified)</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> default_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 14.53940 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 16.66667 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 18.85246 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 16.72355 </td> </tr> </tbody> </table> ] --- name: colors ## Cluster Sampling .pull-left[ - Random sample taken from list of unique ZIP codes created in EDA phase - All "clusters" of observations which fell into any of the randomly selected ZIP codes combined to form cluster sample - 450 clusters (i.e., sampled ZIP codes) randomly selected to obtain a sample close to the target sample size (3,971 total observations) <img src="cluster_vis.png" width="550px" height="300px" /> ] .pull-right[ <table> <caption>Regional Default Rates (Cluster)</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> default_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 14.87514 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 18.14988 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 20.05420 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 15.84362 </td> </tr> </tbody> </table> ] --- name: colors ## Cluster Sampling
--- name: colors ## Population Default Rates - Default rates calculated for each region based on the entire study population - Can be used to compare with sample estimates <table> <caption>Regional Default Rates (Population)</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> default_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 15.84033 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 15.93425 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 20.52876 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 17.19110 </td> </tr> </tbody> </table> --- class: inverse center middle # Visualizations --- name: middle ## Default Rates <table> <caption>Comparison of Regional Default Rates Among Population & Different Samples</caption> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> Population </th> <th style="text-align:right;"> SRS </th> <th style="text-align:right;"> Systematic </th> <th style="text-align:right;"> Stratified </th> <th style="text-align:right;"> Cluster </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Midwest </td> <td style="text-align:right;"> 15.84033 </td> <td style="text-align:right;"> 17.54190 </td> <td style="text-align:right;"> 16.89840 </td> <td style="text-align:right;"> 14.53940 </td> <td style="text-align:right;"> 14.87514 </td> </tr> <tr> <td style="text-align:left;"> Northeast </td> <td style="text-align:right;"> 15.93425 </td> <td style="text-align:right;"> 16.60819 </td> <td style="text-align:right;"> 16.16628 </td> <td style="text-align:right;"> 16.66667 </td> <td style="text-align:right;"> 18.14988 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 20.52876 </td> <td style="text-align:right;"> 20.40073 </td> <td style="text-align:right;"> 21.55412 </td> <td style="text-align:right;"> 18.85246 </td> <td style="text-align:right;"> 20.05420 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 17.19110 </td> <td style="text-align:right;"> 16.92708 </td> <td style="text-align:right;"> 18.24687 </td> <td style="text-align:right;"> 16.72355 </td> <td style="text-align:right;"> 15.84362 </td> </tr> </tbody> </table> --- name: colors ## Bar Charts - South consistently exhibited highest default rate across all sampling groups, order of other three regions depended on sample - Similar variation in calculated default rate across sampling groups for each region <img src="STA490_SamplingPresentationRevisedFinal_files/figure-html/unnamed-chunk-21-1.png" width="100%" /> --- name: colors ## Bar Charts - Beyond the South having the highest default rate, there are considerable differences in the suggestions of each sample. - E.g., SRS suggests Midwest has 2nd highest default rate (around 17.5%), while according to the stratified sample it actually has the lowest (under 15%). - Overall, the regional default rates from the systematic sample bear the closest resemblance to those of the study population. <img src="STA490_SamplingPresentationRevisedFinal_files/figure-html/unnamed-chunk-22-1.png" width="100%" /> --- class: inverse center middle # Conclusion --- name: colors ## Conclusion <font size="5.5"> - Systematic sample recommended based on closest resemblance of regional default rates to those of the population - Not entirely representative (e.g., estimates for South & West are noticeably higher than population figures - Not necessarily an optimal sample, may be advisable to try alternative sampling methods, sample sizes, stratification variables, etc. to generate a more representative sample --- class: inverse center middle # Questions? --- class: colors ## Contributions - Content ~ Haley - Slide Style ~ Gianna - Edit Content/Slides ~ Haley, Gianna