October, 25 2022

Research Question

-What is the best sampling method for this data set?

- Simple Random Sampling    
- Systematic Random Sampling    
- Stratified Sampling    

The Sampling Methods
  • Simple Random Sampling
    • subset chosen from a larger population in which the individuals are chosen randomly, all with the same probability

Simple Random Sampling

  • Systematic Random Sampling
    • probability sampling method in which a random sample, with a fixed periodic interval, is selected from a larger population

Systematic Sampling

  • Stratified Sampling
    • involves the division of a population into smaller subgroups known as strata, based on members’ shared attributes or characteristics, then independently sampled randomly

Stratified Sampling

The Data
  • 899,164 observations
  • 27 variables
  • Provides information on loans that were guaranteed to some degree by the SBA between 1987 and 2014
  • The strata are separated by the stratification variable “State” after the states are combined into categories by region
  • “State” because geographical location may have an impact on other variables such as default rate

US Map

The Data
  • Checking for small number of observations in each category within the variable State
Population.size Number.of.Regions Sub.Pop.less.1000
899164 7 0

Simple Random Sampling
  • Using the R random number generator, the simple random sample was formed
  • Table provided shows the number of observations in the sample, and the number of variables
  • Checking to make sure the amount of variables are 5% or less of the amount of observations
Size Var.count
4000 28

Systematic Random Sampling
  • Using R to skip observations and systematically call on random variables
  • Starting point is observation 107
  • Jump size is 224
  • Table created to show number of observations and variables in the new sample
  • Checking for variables to be 5% of observation amount
Size Var.count
4014 28

Stratified Sampling
  • Taking a proportional amount of observations from each category in regions using frequency function in R
  • The proportions are all even
MidAtlantic Midwest Northeast Northwest Southeast Southwest West
571 571 571 571 571 571 571

The Results
  • Default rates are compared to test the accuracy of the samples
  • The table shows that the systematic sample is the best
  • The stratified data is way off
    • More research is needed to explore why

The Results in a Graph
  • Reaffirming systematic sampling as the best fit
  • No lines for the consistent values in systematic and stratified sampling

The Conclusion
  • After creating 3 different samples with 3 different sampling methods and stratifying the data by “State” a table was created from the default rates extracted from the samples
  • The table and graph combined supports the idea that the systematic sampling method better represents the original population
  • Now that the best sampling method is known, more analysis can be performed, exploring trends of the bank loans data