1 Introduction

I carry out an analysis by comparing the performance of three random sampling types: simple random sampling, systematic sampling, and stratified sampling based on a large bank loan data set as the finite population.

The bank loan data set was provided by the U.S. Small Business Administration (SBA) which contains all historical loans endorsed by SBA from 1987 through 2014. This data set contains 27 variables and 899,164 observations. Each observation represents a loan that was guaranteed to some degree by the SBA. Detailed information about the data set can be found at (https://pengdsci.github.io/datasets/LoanData-description.pdf).

The original data set was split into 9 subsets that are stored on GitHub. We need to load these data sets to R and create a single data set.

Since this is an exploratory data analysis, we only use a graph to visually compare the three sampling plans.

In the next few sections, I will review the three sampling types and perform some data management tasks to define the study population. The three random samples will then be drawn from the study populations. I will present the comparison using a graphical approach. Some discussions and remarks will be presented at the end of this report.

2 Review: Sampling Plans

In statistics, there any numerous amount of ways to sample a population to draw inferences about the population. However, random sampling has been proven to be the most effective in accurately representing the population. In this analysis, I use three types of sampling: simple random sampling, systematic sampling, and stratified sampling.

2.1 Simple Random Sampling

Simple Random Sampling is the best sampling technique that can be used in statistics. When taking a simple random sample (SRS) with size \(n\) from a defined finite population, we assume that every possible combination of \(n\) data points have an equally likely probability to be selected in the random sample.

For example, lets say we have a population of 10 people in a room and we want to take a random sample of 3 people. We would assume that there is an equally likely chance that person 1, 2, and 3 are selected and that person 4, 5, and 6 are selected.

2.2 Systematic Sampling

The key step in Systematic Sampling is utilizing what is called a jump size (often denoted \(m\)). You find your jump size by taking your population size, \(N\), and dividing it by your ideal sample size \(n\). That is, \(m \approx N/n\). The next step is to take a random number below your \(m\) value, and then take every “m-th” observation after that. The systematic sampling plan is a valid choice because the starting point is chosen at random.

For example, lets say this time we have a population of 50 people in a room and we want to take a systematic sample of 10 people. Out jump size (\(m\)) would be \(50/10 = 5\). Then, we would take a random number below 5 as out starting point (2 for example). Next, we would take the every 5-th observation afterwards. So, person 2, 7, 12, 17 and so on would be chosen until we have a sample of 10 people.

2.3 Stratified Sampling

Stratified Sampling acts as an alternative to SRS when SRS may be too difficult to obtain. Stratified sampling splits the population into groups via a stratification variable, and then random samples are taken from each corresponding group. One important notion about stratified sampling is that the acquired sample but me proportional to the given population.

For example, lets say our population is a room full of 100 people. For this example, out stratification variable is sex. There are 40 females and 60 males in the room. So, we would separate the population into two subpopulations: one for males and another for females. Then, we would take a SRS from reach subpopulation as such to ensure that the ratio between males and females remain 60:40.

3 Stratification Variable and Study Population

I need to define a stratification variable for stratified sampling. In order to do this, I must ensure that each category of the stratification variable has enough objects to be sampled.

3.1 Stratification Variable

In my analysis, I modify the Approval Fiscal Year (ApprovalFY) to define a stratification variable for stratified sampling.

ApprovalFY is a 4-digit number representing the year that the loan was originally initiated, ranging from 1962 through 2014. Below, I explore the frequency distribution of the 4-digit Approval year to decide potential combinations of categories.

  • There are two categories that do not have any observations: 1963 and 1964.

  • Several categories (namely before 1990) have very few observations.

Because of both of these reasons, I will combine years 1962 through 1989 into into one category (named 1989) to ensure there are enough observations for proper sampling. I create a string variable strApprovalFY to represent the modified years.

3.2 Study Population

Based on the above frequency distribution of the Approval Year, I will use the following inclusion rule to define the study population: combining small-size categories 1962, 1965, 1966, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989.

The study population has 899164 small businesses across 26 years with 29 variables including some derived variables for sampling purposes.

3.3 Loan Default Rates by Year: Study Population

I now find the loan default rates by year defined by the stratification variable strApprovalFY. The loan default status can be defined by the variable MIS_Status.

4 Drawing Random Samples

I am implementing three different sampling plans, one from each type of sampling listed above in this report. In each sampling plan, I select 4000 observations in the corresponding samples

I utilize the base R function sample() to take a SRS sample. To utilize this R function, I define each observation IS numbered from 1 to 899164 so each individual small business in the study population has a unique ID. The three different samples will be drawn based on these ID’s using sample().

Simple Random Sampling

I take random ID’s and then identify the records based on the sampled ID’s to obtain the SRS sample.

Systematic Sampling

The jump size is calculated by \(m\) = 899164/4000 = 224.791. The actual jump size is 224. I use the sample() function to take a random record from the first 224 records and then select every 224th record to include in the systematic sample. Due to the original \(m\) value being a decimal, there actual systematic sample might not be exactly 4000.

Stratified Sampling

I take an SRS from each stratum. The sample size should be approximately proportional to the size of the corresponding stratum.

5 Performance Analysis of Random Samples

In this section, I perform a comparative analysis comparing each sampling plan to each other. One metric I use is the default rate in each year (which was also used at the stratification variable).

The summarized table with population and sample level default rates are below:

Firstly, lets note that the above table of default rates based on random samples are random. Thus, the following observations are solely based on this random sample and might not apply to other random samples.

  • The sample default rates for some years vary between the sampling methods.

  • The default rates per sampling method are accurate during some years and are inaccurate during others.

  • There are some years that are not represented in the samples due to randomness.

The patterns of year-specific default rates are also reflected in the above line plot.

TO see the overall performance among the three sampling plans based on the single-step samples, I look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The results are summarized in the graph above, which shows Stratified and SRS significantly outperforms the systematic sampling plan.

Important Note: The pattern observed above about the discrepancy of population and sample rates could be changed significantly across the samples.

6 Discussions and Closing Remarks

I implemented the three well-known sampling plans that are commonly used in practice on a large SBA bank loan data. The approval fiscal year of commitment was used to define the study population and the stratification variable for stratified sampling. The difference between population-level year-specific default rates and sample-level rates was used to compare the performance of the sampling plans.

The comparison results were based on the one-step sample, there could potentially be outcome-altering interactions. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean MSE’s, as I did above.