1 Introduction

Three different kinds of random samples were taken for an analysis and their performances were compared: simple random sampling, a systematic random sampling, and stratified random sampling.

We used a U.S. Small Business Administration (SBA) dataset for my analysis, which uses data between the years of 1987 and 2014. The dataset contained 899,164 observations and 27 variables, with each observation representing a loan guaranteed to some degree by the SBA.

The actual SBA data set was split into 9 subsets that are stored in Professor Peng’s repository on GitHub. We loaded the data subsets into R and merged them into one single data set.

The goal of the analysis is to compare the performances of the three random sampling plans by using the loan default rates as a reference metric. The Franchise Code variable was used to distribute the study population into separate sub-populations. The effectiveness of the three different random sampling plans is shown by differences in the default rates and sample rates of the newly made sub-populations. These rates used for the comparison are displayed by use of a graph.

Before we divulge into the analysis, we will first discuss the three different sampling plans and then perform data management tasks to better denote the study population. Once the three random samples are drawn from these study populations, the comparison will be presented using graphs. Discussion of the results and other remarks will also be given afterwards.

2 Review of Three Different Random Sampling Plans

Only random samples based on probabilistic sampling can be used to draw statistically significant results. In the analysis, we will use three types of random sampling plans: simple random sampling (SRS), systematic sampling, and stratified sampling.

2.1 Simple Random Sampling

Simple random sampling (or SRS) is the most used and efficient statistical analysis sample. All combinations of data points from the sample of size \(n\) have are equally as likely to be selected for the analysis.

The image illustrates the the idea of simple random samples.

2.2 Systematic Random Sampling

For systematic random sampling, a jump size, calculated by \(m \approx N/n\), is needed to take an interval of every m-th observation to form a sample after the first observation is chosen randomly. For this jump size formula, N = population size while n = random sample size drawn from that population.

The image illustrates the the idea of systematic random samples.

This example has jump size = 3 and the first random subject is the second one in the population, then the systematic sample is gathered by taking every third subject after.

2.3 Stratified Random Sampling

When simple random samples are to challenging to get, stratified random sampling can be used. A stratification variable is created and used to split a population into separate stratums by that variable. A SRS is taken from each stratum and they are combined into one sample. The sub-sample taken from each stratum must be relative to the analogous sub-population size to get a combined sample similar to the other sample types.

3 The Stratification Variable and Study Population

First, a stratification variable must be defined for stratified sampling. That means we must ensure each new category of this newly created categorical variable should have enough observations to be sampled. Therefore, it is better to exclude some smaller categories or categories with observations that have no value. This final stratification variable also defines the study population for the analysis.

3.1 Stratification Variable

A stratification variable can be created by discretizing a numerical variable, or using or modifying an existing categorical variable. For this analysis, we modified the Franchise Code variable to designate a stratification variable for the stratified random sampling.

The Franchise Code is a 5-digit code. We use the first two digits of the code as a basis for defining the stratification variable.

Then, we explore the frequency distribution of the 2-digit franchise codes and find categories with a small size.

0	1	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	3	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	86	87	88	89	90	91	92	94	95	96	97	98	99
208835	638554	2548	55	15	328	91	746	667	1571	320	434	779	1827	776	314	827	908	567	91	231	538	12	300	288	177	305	978	1098	512	244	849	546	231	408	388	480	328	694	370	236	537	407	1707	283	1374	700	411	476	682	246	203	218	359	710	336	389	679	927	736	1010	2688	273	577	257	489	735	377	795	276	493	3658	1247	632	858	268	338	745	191	82	53	312	626	607	297	85	68	63	18	48	151	1

847389 businesses do not have a franchise code, marked as 0 or 1, which limited the population that could be used. Even though a good number of observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. Since the 2-digit franchise code will be used to stratify the population, this variable will be included in the study population.
Several categories (11, 12, 14, 27, 3, 86, 87, 92, 94, 95, 96, 97, 98, 99) have relatively small sizes.
The 2-digit code must be changed to define the final stratification variable for the stratified sampling.

3.2 Study Population

Using the modified 2-digit franchise codes frequency distribution, the following inclusion rule was implemented to define the study population: unclassified businesses with franchise codes 0 or 1, and excluding small-size categories, listed above.

10	13	15	16	17	18	19	20	21	22	23	24	25	26	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	88	89	90	91
2548	328	746	667	1571	320	434	779	1827	776	314	827	908	567	231	538	300	288	177	305	978	1098	512	244	849	546	231	408	388	480	328	694	370	236	537	407	1707	283	1374	700	411	476	682	246	203	218	359	710	336	389	679	927	736	1010	2688	273	577	257	489	735	377	795	276	493	3658	1247	632	858	268	338	745	191	312	626	607	297

The study population has 50942 small businesses across 76 different franchise categories with 29 variables including some derived variables for sampling purposes.

3.3 Loan Default Rates by Franchise Code: Study Population

We now will find the loan default rates by franchise code defined by the stratification variable strFranchiseCode. The loan default status is defined using the variable MIS_Status.

	no.lab	default	no.default	default.rate
0	795	71175	136865	34.2
1	1159	78523	558872	12.3
10	0	402	2146	15.8
11	0	5	50	9.1
12	0	3	12	20.0
13	0	73	255	22.3
14	0	11	80	12.1
15	1	126	619	16.9
16	0	67	600	10.0
17	0	332	1239	21.1
18	0	48	272	15.0
19	0	67	367	15.4
20	0	184	595	23.6
21	3	169	1655	9.3
22	0	59	717	7.6
23	0	48	266	15.3
24	3	121	703	14.7
25	0	92	816	10.1
26	0	74	493	13.1
27	0	21	70	23.1
28	0	74	157	32.0
29	0	90	448	16.7
3	0	3	9	25.0
30	1	53	246	17.7
31	0	70	218	24.3
32	0	40	137	22.6
33	0	54	251	17.7
34	0	209	769	21.4
35	0	164	934	14.9
36	2	46	464	9.0
37	0	28	216	11.5
38	0	61	788	7.2
39	0	92	454	16.8
40	0	46	185	19.9
41	0	54	354	13.2
42	3	39	346	10.1
43	0	58	422	12.1
44	0	82	246	25.0
45	0	91	603	13.1
46	0	69	301	18.6
47	0	46	190	19.5
48	1	68	468	12.7
49	0	67	340	16.5
50	0	221	1486	12.9
51	20	31	232	11.8
52	0	303	1071	22.1
53	0	112	588	16.0
54	0	61	350	14.8
55	0	81	395	17.0
56	1	98	583	14.4
57	0	23	223	9.3
58	0	67	136	33.0
59	0	55	163	25.2
60	0	57	302	15.9
61	0	95	615	13.4
62	0	45	291	13.4
63	0	101	288	26.0
64	1	161	517	23.7
65	1	197	729	21.3
66	0	118	618	16.0
67	0	66	944	6.5
68	1	609	2078	22.7
69	0	41	232	15.0
70	0	104	473	18.0
71	0	55	202	21.4
72	0	78	411	16.0
73	1	75	659	10.2
74	0	54	323	14.3
75	1	115	679	14.5
76	0	39	237	14.1
77	0	59	434	12.0
78	0	209	3449	5.7
79	2	140	1105	11.2
80	0	123	509	19.5
81	0	147	711	17.1
82	0	35	233	13.1
83	0	43	295	12.7
84	0	91	654	12.2
85	0	24	167	12.6
86	0	15	67	18.3
87	0	8	45	15.1
88	0	36	276	11.5
89	0	154	472	24.6
90	1	86	520	14.2
91	0	44	253	14.8
92	0	13	72	15.3
94	0	35	33	51.5
95	0	6	57	9.5
96	0	1	17	5.6
97	0	10	38	20.8
98	0	13	138	8.6
99	0	0	1	0.0

4 Drawing Random Samples

Three types of random sampling plans are used in this analysis, each using a total of 5000 observations.

For ease of comparison, We will keep adding the franchise code-specific default rates of separate samples to the franchise code-specific default rates for the whole study population.

Simple Random Sampling

We simply take random observations and then identify records based on the sampled observations. A sampling list was defined and then added to the study population.

Systematic sampling

The jump size is calculated by \(m = 50942/5000 =10.1\). The actual jump size is 10. An observation is taken from the first 10 records and then every 10th observation from there is put into the sample as well.

Stratified Sampling

A simple random sample was taken from each stratum. The sample size, n, is about proportional to the size of the corresponding stratum. First, the simple random sample size for each stratum is calculated. Then the actual SRS is taken from the corresponding stratum and sub-populations.

5 Performance Analysis of Random Samples

A comparative analysis of the three random samples was executed here. One metric we can use for this is the default rate for each franchise code defined by the first two digits of franchise code. That was also used as the stratification variable for the stratified sampling plan.

We have calculated the default rate across the franchise codes. Now, we will use the population-level franchise code-specific rates for reference and compare them to sample-level franchise code-specific default rates. The following table shows the population and sample level default rates for this analysis.

Comparison of franchise code-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
	default.rate.pop	default.rate.srs	default.rate.sys	default.rate.str
10	15.8	15.4	15.5	20.8
13	22.3	29.7	8.9	18.8
15	16.9	15.0	13.6	20.5
16	10.0	11.5	8.1	1.5
17	21.1	18.4	20.1	25.3
18	15.0	17.9	20.0	16.1
19	15.4	13.6	13.0	14.0
20	23.6	17.3	25.6	18.4
21	9.3	8.6	6.4	8.4
22	7.6	5.9	6.5	9.2
23	15.3	11.1	23.7	16.1
24	14.7	19.5	14.1	17.3
25	10.1	10.3	16.9	11.2
26	13.1	9.8	10.9	19.6
28	32.0	26.3	27.8	26.1
29	16.7	12.9	9.4	17.0
30	17.7	26.7	18.8	13.8
31	24.3	21.4	15.4	25.0
32	22.6	29.4	7.7	11.8
33	17.7	25.0	7.4	10.0
34	21.4	18.1	25.5	27.1
35	14.9	16.7	15.0	14.8
36	9.0	8.2	10.0	6.0
37	11.5	13.0	18.5	4.2
38	7.2	7.9	2.6	3.6
39	16.8	17.2	15.1	18.5
40	19.9	14.3	0.0	21.7
41	13.2	8.5	6.8	10.0
42	10.1	4.2	9.1	10.5
43	12.1	13.2	8.9	14.9
44	25.0	23.1	29.2	15.6
45	13.1	8.4	9.5	13.2
46	18.6	17.1	16.7	13.9
47	19.5	17.2	14.3	17.4
48	12.7	18.3	17.1	5.7
49	16.5	6.8	23.7	10.0
50	12.9	9.0	15.0	16.7
51	11.8	7.4	10.3	10.7
52	22.1	20.0	18.0	22.2
53	16.0	21.8	8.5	11.6
54	14.8	18.9	6.5	12.5
55	17.0	10.4	14.3	27.7
56	14.4	8.8	17.5	9.0
57	9.3	2.9	10.0	16.7
58	33.0	41.2	16.7	20.0
59	25.2	28.6	23.5	23.8
60	15.9	19.2	11.4	14.3
61	13.4	5.7	11.1	8.6
62	13.4	9.5	14.7	9.1
63	26.0	20.0	25.0	21.1
64	23.7	22.8	19.2	22.4
65	21.3	15.9	22.6	19.8
66	16.0	17.9	16.1	18.1
67	6.5	8.6	4.4	9.1
68	22.7	23.9	18.5	26.1
69	15.0	10.0	17.9	22.2
70	18.0	17.2	14.8	12.3
71	21.4	20.7	26.7	24.0
72	16.0	15.5	21.4	20.8
73	10.2	13.6	6.2	12.7
74	14.3	10.0	24.2	10.8
75	14.5	14.6	13.9	24.4
76	14.1	21.4	19.4	11.1
77	12.0	20.4	14.5	12.5
78	5.7	3.2	4.8	6.1
79	11.2	9.9	14.0	13.1
80	19.5	15.4	26.7	14.5
81	17.1	15.7	15.2	11.9
82	13.1	8.3	20.6	7.7
83	12.7	17.9	12.8	9.1
84	12.2	10.0	14.3	11.0
85	12.6	14.3	18.8	21.1
88	11.5	16.1	11.1	12.9
89	24.6	29.7	28.8	27.9
90	14.2	9.7	10.0	6.7
91	14.8	15.0	7.4	6.9

This table of the random samples’ default rates are random. Therefore, these following observations are based on the table only.

The sample default rate for some of the franchise codes have variations that are very great compared to the true default rates at the population level.
The sample default rates are close to the population default rates. Therefore, we will not be testing the significance of differences between these default rates between the population and samples.

The above patterns of franchise code-specific default rates are shown in the graphs above.

The overall performance of three random sampling plans based on the samples under each is summarized in the first graph. We will also look at the mean square errors (MSE) of the differences in the default rates between each of the three random samples and the population, summarized in the second graph. It appears the simple random sampling (SRS) plan performs better than the other sampling plans. Therefore, the simple random sampling plan is the best type of random sample to use for the analysis of the SBA data.

6 Discussion of Results and Conclusions

Three types of random sampling plans were used in a comparative performance analysis for the SBA bank loan dataset. The FranchiseCode variable was used to determine the stratification variable for the stratified sampling plan as well as a study population. The difference between sample-level industry-specific default rates and rates at the population-level was used to compare the performance of the three different sampling plans. It appeared that the simple random sampling (SRS) plan performed the best and should be the type of random sample used in this analysis.

The results of the comparative performance analysis, however, were based on the one-step sample, so there could be significant variations. Taking multiple samples per sampling plan instead of one probably would have been a better method for this analysis. Also, many of the observations were not actually used in the analysis since they did not have a franchise code. It is possible the performance analysis of the three random sampling plans would have turned out different if more of the observations had actual franchise codes to use.

7 Appendix

Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
	default.rate.pop	default.rate.srs	default.rate.sys
10	15.8	15.4	15.5
13	22.3	29.7	8.9
15	16.9	15.0	13.6
16	10.0	11.5	8.1
17	21.1	18.4	20.1
18	15.0	17.9	20.0
19	15.4	13.6	13.0
20	23.6	17.3	25.6
21	9.3	8.6	6.4
22	7.6	5.9	6.5
23	15.3	11.1	23.7
24	14.7	19.5	14.1
25	10.1	10.3	16.9
26	13.1	9.8	10.9
28	32.0	26.3	27.8
29	16.7	12.9	9.4
30	17.7	26.7	18.8
31	24.3	21.4	15.4
32	22.6	29.4	7.7
33	17.7	25.0	7.4
34	21.4	18.1	25.5
35	14.9	16.7	15.0
36	9.0	8.2	10.0
37	11.5	13.0	18.5
38	7.2	7.9	2.6
39	16.8	17.2	15.1
40	19.9	14.3	0.0
41	13.2	8.5	6.8
42	10.1	4.2	9.1
43	12.1	13.2	8.9
44	25.0	23.1	29.2
45	13.1	8.4	9.5
46	18.6	17.1	16.7
47	19.5	17.2	14.3
48	12.7	18.3	17.1
49	16.5	6.8	23.7
50	12.9	9.0	15.0
51	11.8	7.4	10.3
52	22.1	20.0	18.0
53	16.0	21.8	8.5
54	14.8	18.9	6.5
55	17.0	10.4	14.3
56	14.4	8.8	17.5
57	9.3	2.9	10.0
58	33.0	41.2	16.7
59	25.2	28.6	23.5
60	15.9	19.2	11.4
61	13.4	5.7	11.1
62	13.4	9.5	14.7
63	26.0	20.0	25.0
64	23.7	22.8	19.2
65	21.3	15.9	22.6
66	16.0	17.9	16.1
67	6.5	8.6	4.4
68	22.7	23.9	18.5
69	15.0	10.0	17.9
70	18.0	17.2	14.8
71	21.4	20.7	26.7
72	16.0	15.5	21.4
73	10.2	13.6	6.2
74	14.3	10.0	24.2
75	14.5	14.6	13.9
76	14.1	21.4	19.4
77	12.0	20.4	14.5
78	5.7	3.2	4.8
79	11.2	9.9	14.0
80	19.5	15.4	26.7
81	17.1	15.7	15.2
82	13.1	8.3	20.6
83	12.7	17.9	12.8
84	12.2	10.0	14.3
85	12.6	14.3	18.8
88	11.5	16.1	11.1
89	24.6	29.7	28.8
90	14.2	9.7	10.0
91	14.8	15.0	7.4

Before creating the table comparing default rates for all the samples, a table was made excluding the stratified random sample to compare the other two samples first. It turned out that the simple random sampling plan performed much better compared to the systematic random sampling plan.

SBA Bank Loan Random Sampling and Performance Analysis

Andrew Heneghan

10/23/2022