1 Introduction

We are going to do an analysis by comparing 3 samplings plans which are simple random sampling, stratified sampling, and systematic sampling. The Bank load data is treated as a population that has 9 subsets. We turn these 9 subsets into one data set called bankloan.

2 Stratification variable

It is a categorical variable that is used to stratify a population depending on the values. Each value is in a subpopulation. When we sample of whole population, we take a sub samples from a subpopulations to create a sample from the whole population. In this example, I will show how Zipcodes can be modified into a stratification variable

Population.size
899164

I first tried to extract the first two digits from the zipcodes in order to create a frequency distribution of 2 digit zip codes and decide potential combinations of categories with smaller sizes

Var1 Freq
0 283
1 24
10 14759
11 19211
12 6749
13 6397
14 14310
15 10221
16 6293
17 6181
18 7530
19 15093
2 11
20 10208
21 13148
22 5263
23 6829
24 2888
25 1965
26 2631
27 8414
28 13556
29 8861
3 5
30 21577
31 5178
32 14010
33 23899
34 6252
35 5748
36 3143
37 7174
38 7549
39 6829
4 5
40 6244
41 2007
42 2392
43 10843
44 14441
45 8705
46 11399
47 3300
48 13097
49 8183
5 5
50 5993
51 2395
52 4656
53 11261
54 11936
55 17969
56 7381
57 5128
58 5808
59 8778
6 4
60 23790
61 5167
62 4090
63 8926
64 8814
65 8312
66 6508
67 6481
68 7153
69 1139
7 6
70 12487
71 4003
72 5524
73 5510
74 6082
75 17706
76 12580
77 20879
78 15928
79 9346
8 15
80 20308
81 3792
82 3370
83 10088
84 18872
85 16923
86 2035
87 4524
88 5419
89 8308
9 24
90 25034
91 20052
92 32356
93 12858
94 17673
95 21038
96 5010
97 11083
98 20000
99 5832

Afterwards, I grouped the first first 2 digits of the zipcodes into groups of 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, and 90s. The first two digits of the zipcodes represents a region in the U.S. For example if 19 is extracted fom 19446, then 19 would represent a zip code from a Northeastern state like Pennsylvania. Afterwards, we will define a new population by deleting observations ranging 0 to 9 because some states states do not have a zip code or the size is relatively small in order to get the new smaller population.

2.1 Loan Default Rates By Industry

We now find the loan default rates by industry defined by the stratification variable strZip. Loan default status is defined by the variable MIS_Status.

no.lab default no.default default.rate
0 0 222 61 78.4
1 0 1 23 4.2
109 758 18366 87620 17.3
2 0 4 7 36.4
209 273 12726 60764 17.3
3 0 1 4 20.0
309 139 22512 78708 22.2
4 0 0 5 0.0
409 236 14465 65910 18.0
5 0 0 5 0.0
509 87 8542 72676 10.5
6 0 1 3 25.0
609 218 13756 66406 17.2
7 0 1 5 16.7
709 83 20702 89260 18.8
8 0 6 9 40.0
809 49 16347 77243 17.5
9 0 4 20 16.7
909 154 29902 140880 17.5

2.2 Study Population

109 209 309 409 509 609 709 809 909
106744 73763 101359 80611 81305 80380 110045 93639 170936

3 Sampling Plans

Now, we will implement three sampling plans. Each sampling plan contains 1000 observations. ## Simple Random Sampling We will take a sample of 1000 from the whole population of 899164. For this procedure, we are only taking a subset of 1000 Zip codes randomly, each with the same probability of being chosen. In the Sample, we have a variance of 30.

Size Var.count
1000 30

3.1 Systematic sampling

We perform systematic sampling where it is taking every 1th observation in the column for the variable Zip, the sample will consist of everything 1th observation in the column ranging from n=1 t0 n= 899164. The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling. The equation 694216/1000 to get a jump size of 69. We use sample()random take a record from the first 69 records and then select every 69rd record to include in the systematic sample.

Size Var.count
1000 30

3.2 Stratified Sampling

Then, I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population, which are listed below.

109 209 309 409 509 609 709 809 909
119 82 113 90 90 89 122 104 190

The final code is taking 119 zip codes from the 109 group, 82 zip codes from the 209 group, 113 zip codes from the 309 group, 91 zip coes from the 409 group, 90 zip codes from the 509 group, 89 zip codes from the 609 group, 122 zip codes from the 809 group, and 190 zip codes from the 909 group to create a stratified sample of 1000.

Size Var.count
1000 30

4 Performance Analysis of Random Samples

In this section, we perform a comparative analysis of the three random samples. One metric we can use is the default rate in each industry defined by the first two digits of the Zip codes.

4.1 Population-level Default Rates

We have calculated the default rate across the states previously. We will use the population level rates to compare them with sample-level industry specific rates. For the table, the MIS Status enables us to see how many people who live in particular zip codes defaulted on their loans and how many did not default on their loans on the 3rd and 4th columns. The last column gives us the percentage of people living within a particular set of zip codes who defaulted on their loans.

Population size, default counts, and population default rates
no.lab default no.default default.rate
0 0 222 61 78.4
1 0 1 23 4.2
109 758 18366 87620 17.3
2 0 4 7 36.4
209 273 12726 60764 17.3
3 0 1 4 20.0
309 139 22512 78708 22.2
4 0 0 5 0.0
409 236 14465 65910 18.0
5 0 0 5 0.0
509 87 8542 72676 10.5
6 0 1 3 25.0
609 218 13756 66406 17.2
7 0 1 5 16.7
709 83 20702 89260 18.8
8 0 6 9 40.0
809 49 16347 77243 17.5
9 0 4 20 16.7
909 154 29902 140880 17.5

4.2 Industry-Specific Default Rates based on SRS

To compare, we construct the following table that includes the industry-specific default rates. We will see that some of the industry-specific default rates seem to be different between SRS and Population. More visual comparisons will be given in the next section.

Comparison of industry-specific default rates between population and the SRS.
default.rate.pop default.rate.srs
upper North East 17.3 19.6
Lower North East 17.3 16.3
Lower South East 22.2 23.2
Upper Mid East 18.0 17.6
Upper Middle 10.5 11.8
Center of U.S 17.2 22.0
South of U.S 18.8 13.7
Mid West 17.5 16.7
West Coast 17.5 19.2

4.3 Industry-specific Rates- Systematics Sample

The table will have rates of population, SRS, and systematic random samples.

Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
default.rate.pop default.rate.srs default.rate.sys
109 17.3 19.6 17.3
209 17.3 16.3 14.8
309 22.2 23.2 22.8
409 18.0 17.6 26.8
509 10.5 11.8 10.9
609 17.2 22.0 10.6
709 18.8 13.7 17.7
809 17.5 16.7 18.2
909 17.5 19.2 20.5

4.4 Industry-specific Default Rates- Stratified Sample

Here, we will put all information in the following table.It seems that the stratified sample performs better than the SRS sample.

Comparison of industry-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
default.rate.pop default.rate.srs default.rate.sys default.rate.str
upper North East 17.3 19.6 17.3 20.3
Lower North East 17.3 16.3 14.8 19.5
Lower South East 22.2 23.2 22.8 20.4
Upper Mid East 18.0 17.6 26.8 17.6
Upper Middle 10.5 11.8 10.9 12.2
Center of U.S 17.2 22.0 10.6 15.7
South of U.S 18.8 13.7 17.7 18.9
Mid West 17.5 16.7 18.2 19.2
West Coast 17.5 19.2 20.5 15.8

5 Visual Comparison

First of all, we note that the above table of default rates based on random samples are random. The follow observations are solely based on this random table. In the previous section, we calculated the industry-specific default rates for population, SRS, systematic, and stratified samples. We now create a statistical graphs in order to compare the default rates among the samples. The Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population. However, we have not tested the significance of the differences between the default rates between the population and samples.

5.1 Mean squared error

The above patterns of industry-specific default rates in the following line plot.

However, when we look at the overall performance among the three sampling plans based on these single-step samples, we look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The result is summarized in the bottom panel of the above figure. It turns out that the systematic sampling plan actually outperforms the SRS and stratified plans.

6 Conclusion

We have implemented the three sampling plans that are commonly used in practice based on large bank loan data. The Zip code was used to define the study population and the stratification variable for stratified sampling. The difference between population-level industry-specific default rates and sample-level rates was used to compare the performance of all three sampling plans. The comparison results were based on a one-step sample. There could be big variations. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean sqred errors.