1 Introduction

Three different kinds of random samples were taken for an analysis and their performances were compared: simple random sampling, a systematic random sampling, and stratified random sampling.

We used a U.S. Small Business Administration (SBA) dataset for my analysis, which uses data between the years of 1987 and 2014. The dataset contained 899,164 observations and 27 variables, with each observation representing a loan guaranteed to some degree by the SBA.

The actual SBA data set was split into 9 subsets that are stored in Professor Peng’s repository on GitHub. We loaded the data subsets into R and merged them into one single data set.

The goal of the analysis is to compare the performances of the three random sampling plans by using the loan default rates as a reference metric. The Franchise Code variable was used to distribute the study population into separate sub-populations. The effectiveness of the three different random sampling plans is shown by differences in the default rates and sample rates of the newly made sub-populations. These rates used for the comparison are displayed by use of a graph.

Before we divulge into the analysis, we will first discuss the three different sampling plans and then perform data management tasks to better denote the study population. Once the three random samples are drawn from these study populations, the comparison will be presented using graphs. Discussion of the results and other remarks will also be given afterwards.

2 Review of Three Different Random Sampling Plans

Only random samples based on probabilistic sampling can be used to draw statistically significant results. In the analysis, we will use three types of random sampling plans: simple random sampling (SRS), systematic sampling, and stratified sampling.

2.1 Simple Random Sampling

Simple random sampling (or SRS) is the most used and efficient statistical analysis sample. All combinations of data points from the sample of size \(n\) have are equally as likely to be selected for the analysis.

The image illustrates the the idea of simple random samples.


2.2 Systematic Random Sampling

For systematic random sampling, a jump size, calculated by \(m \approx N/n\), is needed to take an interval of every m-th observation to form a sample after the first observation is chosen randomly. For this jump size formula, N = population size while n = random sample size drawn from that population.

The image illustrates the the idea of systematic random samples.

This example has jump size = 3 and the first random subject is the second one in the population, then the systematic sample is gathered by taking every third subject after.


2.3 Stratified Random Sampling

When simple random samples are to challenging to get, stratified random sampling can be used. A stratification variable is created and used to split a population into separate stratums by that variable. A SRS is taken from each stratum and they are combined into one sample. The sub-sample taken from each stratum must be relative to the analogous sub-population size to get a combined sample similar to the other sample types.

3 The Stratification Variable and Study Population

First, a stratification variable must be defined for stratified sampling. That means we must ensure each new category of this newly created categorical variable should have enough observations to be sampled. Therefore, it is better to exclude some smaller categories or categories with observations that have no value. This final stratification variable also defines the study population for the analysis.

3.1 Stratification Variable

A stratification variable can be created by discretizing a numerical variable, or using or modifying an existing categorical variable. For this analysis, we modified the Franchise Code variable to designate a stratification variable for the stratified random sampling.

The Franchise Code is a 5-digit code. We use the first two digits of the code as a basis for defining the stratification variable.

Then, we explore the frequency distribution of the 2-digit franchise codes and find categories with a small size.

0 1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 94 95 96 97 98 99
208835 638554 2548 55 15 328 91 746 667 1571 320 434 779 1827 776 314 827 908 567 91 231 538 12 300 288 177 305 978 1098 512 244 849 546 231 408 388 480 328 694 370 236 537 407 1707 283 1374 700 411 476 682 246 203 218 359 710 336 389 679 927 736 1010 2688 273 577 257 489 735 377 795 276 493 3658 1247 632 858 268 338 745 191 82 53 312 626 607 297 85 68 63 18 48 151 1
  • 847389 businesses do not have a franchise code, marked as 0 or 1, which limited the population that could be used. Even though a good number of observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. Since the 2-digit franchise code will be used to stratify the population, this variable will be included in the study population.

  • Several categories (11, 12, 14, 27, 3, 86, 87, 92, 94, 95, 96, 97, 98, 99) have relatively small sizes.

  • The 2-digit code must be changed to define the final stratification variable for the stratified sampling.

3.2 Study Population

Using the modified 2-digit franchise codes frequency distribution, the following inclusion rule was implemented to define the study population: unclassified businesses with franchise codes 0 or 1, and excluding small-size categories, listed above.

10 13 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 88 89 90 91
2548 328 746 667 1571 320 434 779 1827 776 314 827 908 567 231 538 300 288 177 305 978 1098 512 244 849 546 231 408 388 480 328 694 370 236 537 407 1707 283 1374 700 411 476 682 246 203 218 359 710 336 389 679 927 736 1010 2688 273 577 257 489 735 377 795 276 493 3658 1247 632 858 268 338 745 191 312 626 607 297

The study population has 50942 small businesses across 76 different franchise categories with 29 variables including some derived variables for sampling purposes.


3.3 Loan Default Rates by Franchise Code: Study Population

We now will find the loan default rates by franchise code defined by the stratification variable strFranchiseCode. The loan default status is defined using the variable MIS_Status.

no.lab default no.default default.rate
0 795 71175 136865 34.2
1 1159 78523 558872 12.3
10 0 402 2146 15.8
11 0 5 50 9.1
12 0 3 12 20.0
13 0 73 255 22.3
14 0 11 80 12.1
15 1 126 619 16.9
16 0 67 600 10.0
17 0 332 1239 21.1
18 0 48 272 15.0
19 0 67 367 15.4
20 0 184 595 23.6
21 3 169 1655 9.3
22 0 59 717 7.6
23 0 48 266 15.3
24 3 121 703 14.7
25 0 92 816 10.1
26 0 74 493 13.1
27 0 21 70 23.1
28 0 74 157 32.0
29 0 90 448 16.7
3 0 3 9 25.0
30 1 53 246 17.7
31 0 70 218 24.3
32 0 40 137 22.6
33 0 54 251 17.7
34 0 209 769 21.4
35 0 164 934 14.9
36 2 46 464 9.0
37 0 28 216 11.5
38 0 61 788 7.2
39 0 92 454 16.8
40 0 46 185 19.9
41 0 54 354 13.2
42 3 39 346 10.1
43 0 58 422 12.1
44 0 82 246 25.0
45 0 91 603 13.1
46 0 69 301 18.6
47 0 46 190 19.5
48 1 68 468 12.7
49 0 67 340 16.5
50 0 221 1486 12.9
51 20 31 232 11.8
52 0 303 1071 22.1
53 0 112 588 16.0
54 0 61 350 14.8
55 0 81 395 17.0
56 1 98 583 14.4
57 0 23 223 9.3
58 0 67 136 33.0
59 0 55 163 25.2
60 0 57 302 15.9
61 0 95 615 13.4
62 0 45 291 13.4
63 0 101 288 26.0
64 1 161 517 23.7
65 1 197 729 21.3
66 0 118 618 16.0
67 0 66 944 6.5
68 1 609 2078 22.7
69 0 41 232 15.0
70 0 104 473 18.0
71 0 55 202 21.4
72 0 78 411 16.0
73 1 75 659 10.2
74 0 54 323 14.3
75 1 115 679 14.5
76 0 39 237 14.1
77 0 59 434 12.0
78 0 209 3449 5.7
79 2 140 1105 11.2
80 0 123 509 19.5
81 0 147 711 17.1
82 0 35 233 13.1
83 0 43 295 12.7
84 0 91 654 12.2
85 0 24 167 12.6
86 0 15 67 18.3
87 0 8 45 15.1
88 0 36 276 11.5
89 0 154 472 24.6
90 1 86 520 14.2
91 0 44 253 14.8
92 0 13 72 15.3
94 0 35 33 51.5
95 0 6 57 9.5
96 0 1 17 5.6
97 0 10 38 20.8
98 0 13 138 8.6
99 0 0 1 0.0


4 Drawing Random Samples

Three types of random sampling plans are used in this analysis, each using a total of 5000 observations.

For ease of comparison, We will keep adding the franchise code-specific default rates of separate samples to the franchise code-specific default rates for the whole study population.

  • Simple Random Sampling

We simply take random observations and then identify records based on the sampled observations. A sampling list was defined and then added to the study population.

  • Systematic sampling

The jump size is calculated by \(m = 50942/5000 =10.1\). The actual jump size is 10. An observation is taken from the first 10 records and then every 10th observation from there is put into the sample as well.

  • Stratified Sampling

A simple random sample was taken from each stratum. The sample size, n, is about proportional to the size of the corresponding stratum. First, the simple random sample size for each stratum is calculated. Then the actual SRS is taken from the corresponding stratum and sub-populations.


5 Performance Analysis of Random Samples

A comparative analysis of the three random samples was executed here. One metric we can use for this is the default rate for each franchise code defined by the first two digits of franchise code. That was also used as the stratification variable for the stratified sampling plan.

We have calculated the default rate across the franchise codes. Now, we will use the population-level franchise code-specific rates for reference and compare them to sample-level franchise code-specific default rates. The following table shows the population and sample level default rates for this analysis.

Comparison of franchise code-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
default.rate.pop default.rate.srs default.rate.sys default.rate.str
10 15.8 15.4 15.5 20.8
13 22.3 29.7 8.9 18.8
15 16.9 15.0 13.6 20.5
16 10.0 11.5 8.1 1.5
17 21.1 18.4 20.1 25.3
18 15.0 17.9 20.0 16.1
19 15.4 13.6 13.0 14.0
20 23.6 17.3 25.6 18.4
21 9.3 8.6 6.4 8.4
22 7.6 5.9 6.5 9.2
23 15.3 11.1 23.7 16.1
24 14.7 19.5 14.1 17.3
25 10.1 10.3 16.9 11.2
26 13.1 9.8 10.9 19.6
28 32.0 26.3 27.8 26.1
29 16.7 12.9 9.4 17.0
30 17.7 26.7 18.8 13.8
31 24.3 21.4 15.4 25.0
32 22.6 29.4 7.7 11.8
33 17.7 25.0 7.4 10.0
34 21.4 18.1 25.5 27.1
35 14.9 16.7 15.0 14.8
36 9.0 8.2 10.0 6.0
37 11.5 13.0 18.5 4.2
38 7.2 7.9 2.6 3.6
39 16.8 17.2 15.1 18.5
40 19.9 14.3 0.0 21.7
41 13.2 8.5 6.8 10.0
42 10.1 4.2 9.1 10.5
43 12.1 13.2 8.9 14.9
44 25.0 23.1 29.2 15.6
45 13.1 8.4 9.5 13.2
46 18.6 17.1 16.7 13.9
47 19.5 17.2 14.3 17.4
48 12.7 18.3 17.1 5.7
49 16.5 6.8 23.7 10.0
50 12.9 9.0 15.0 16.7
51 11.8 7.4 10.3 10.7
52 22.1 20.0 18.0 22.2
53 16.0 21.8 8.5 11.6
54 14.8 18.9 6.5 12.5
55 17.0 10.4 14.3 27.7
56 14.4 8.8 17.5 9.0
57 9.3 2.9 10.0 16.7
58 33.0 41.2 16.7 20.0
59 25.2 28.6 23.5 23.8
60 15.9 19.2 11.4 14.3
61 13.4 5.7 11.1 8.6
62 13.4 9.5 14.7 9.1
63 26.0 20.0 25.0 21.1
64 23.7 22.8 19.2 22.4
65 21.3 15.9 22.6 19.8
66 16.0 17.9 16.1 18.1
67 6.5 8.6 4.4 9.1
68 22.7 23.9 18.5 26.1
69 15.0 10.0 17.9 22.2
70 18.0 17.2 14.8 12.3
71 21.4 20.7 26.7 24.0
72 16.0 15.5 21.4 20.8
73 10.2 13.6 6.2 12.7
74 14.3 10.0 24.2 10.8
75 14.5 14.6 13.9 24.4
76 14.1 21.4 19.4 11.1
77 12.0 20.4 14.5 12.5
78 5.7 3.2 4.8 6.1
79 11.2 9.9 14.0 13.1
80 19.5 15.4 26.7 14.5
81 17.1 15.7 15.2 11.9
82 13.1 8.3 20.6 7.7
83 12.7 17.9 12.8 9.1
84 12.2 10.0 14.3 11.0
85 12.6 14.3 18.8 21.1
88 11.5 16.1 11.1 12.9
89 24.6 29.7 28.8 27.9
90 14.2 9.7 10.0 6.7
91 14.8 15.0 7.4 6.9

This table of the random samples’ default rates are random. Therefore, these following observations are based on the table only.

  • The sample default rate for some of the franchise codes have variations that are very great compared to the true default rates at the population level.

  • The sample default rates are close to the population default rates. Therefore, we will not be testing the significance of differences between these default rates between the population and samples.

The above patterns of franchise code-specific default rates are shown in the graphs above.

The overall performance of three random sampling plans based on the samples under each is summarized in the first graph. We will also look at the mean square errors (MSE) of the differences in the default rates between each of the three random samples and the population, summarized in the second graph. It appears the simple random sampling (SRS) plan performs better than the other sampling plans. Therefore, the simple random sampling plan is the best type of random sample to use for the analysis of the SBA data.

6 Discussion of Results and Conclusions

Three types of random sampling plans were used in a comparative performance analysis for the SBA bank loan dataset. The FranchiseCode variable was used to determine the stratification variable for the stratified sampling plan as well as a study population. The difference between sample-level industry-specific default rates and rates at the population-level was used to compare the performance of the three different sampling plans. It appeared that the simple random sampling (SRS) plan performed the best and should be the type of random sample used in this analysis.

The results of the comparative performance analysis, however, were based on the one-step sample, so there could be significant variations. Taking multiple samples per sampling plan instead of one probably would have been a better method for this analysis. Also, many of the observations were not actually used in the analysis since they did not have a franchise code. It is possible the performance analysis of the three random sampling plans would have turned out different if more of the observations had actual franchise codes to use.

7 Appendix

Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
default.rate.pop default.rate.srs default.rate.sys
10 15.8 15.4 15.5
13 22.3 29.7 8.9
15 16.9 15.0 13.6
16 10.0 11.5 8.1
17 21.1 18.4 20.1
18 15.0 17.9 20.0
19 15.4 13.6 13.0
20 23.6 17.3 25.6
21 9.3 8.6 6.4
22 7.6 5.9 6.5
23 15.3 11.1 23.7
24 14.7 19.5 14.1
25 10.1 10.3 16.9
26 13.1 9.8 10.9
28 32.0 26.3 27.8
29 16.7 12.9 9.4
30 17.7 26.7 18.8
31 24.3 21.4 15.4
32 22.6 29.4 7.7
33 17.7 25.0 7.4
34 21.4 18.1 25.5
35 14.9 16.7 15.0
36 9.0 8.2 10.0
37 11.5 13.0 18.5
38 7.2 7.9 2.6
39 16.8 17.2 15.1
40 19.9 14.3 0.0
41 13.2 8.5 6.8
42 10.1 4.2 9.1
43 12.1 13.2 8.9
44 25.0 23.1 29.2
45 13.1 8.4 9.5
46 18.6 17.1 16.7
47 19.5 17.2 14.3
48 12.7 18.3 17.1
49 16.5 6.8 23.7
50 12.9 9.0 15.0
51 11.8 7.4 10.3
52 22.1 20.0 18.0
53 16.0 21.8 8.5
54 14.8 18.9 6.5
55 17.0 10.4 14.3
56 14.4 8.8 17.5
57 9.3 2.9 10.0
58 33.0 41.2 16.7
59 25.2 28.6 23.5
60 15.9 19.2 11.4
61 13.4 5.7 11.1
62 13.4 9.5 14.7
63 26.0 20.0 25.0
64 23.7 22.8 19.2
65 21.3 15.9 22.6
66 16.0 17.9 16.1
67 6.5 8.6 4.4
68 22.7 23.9 18.5
69 15.0 10.0 17.9
70 18.0 17.2 14.8
71 21.4 20.7 26.7
72 16.0 15.5 21.4
73 10.2 13.6 6.2
74 14.3 10.0 24.2
75 14.5 14.6 13.9
76 14.1 21.4 19.4
77 12.0 20.4 14.5
78 5.7 3.2 4.8
79 11.2 9.9 14.0
80 19.5 15.4 26.7
81 17.1 15.7 15.2
82 13.1 8.3 20.6
83 12.7 17.9 12.8
84 12.2 10.0 14.3
85 12.6 14.3 18.8
88 11.5 16.1 11.1
89 24.6 29.7 28.8
90 14.2 9.7 10.0
91 14.8 15.0 7.4

Before creating the table comparing default rates for all the samples, a table was made excluding the stratified random sample to compare the other two samples first. It turned out that the simple random sampling plan performed much better compared to the systematic random sampling plan.