We are going to do an analysis by comparing 3 samplings plans which are simple random sampling, stratified sampling, and systematic sampling. The Bank load data is treated as a population that has 9 subsets. We turn these 9 subsets into one data set called bankloan.
It is a categorical variable that is used to stratify a population depending on the values. Each value is in a subpopulation. When we sample of whole population, we take a sub samples from a subpopulations to create a sample from the whole population. In this example, I will show how Zipcodes can be modified into a stratification variable
| Population.size |
|---|
| 899164 |
I first tried to extract the first two digits from the zipcodes in order to create a frequency distribution of 2 digit zip codes and decide potential combinations of categories with smaller sizes
| Var1 | Freq |
|---|---|
| 0 | 283 |
| 1 | 24 |
| 10 | 14759 |
| 11 | 19211 |
| 12 | 6749 |
| 13 | 6397 |
| 14 | 14310 |
| 15 | 10221 |
| 16 | 6293 |
| 17 | 6181 |
| 18 | 7530 |
| 19 | 15093 |
| 2 | 11 |
| 20 | 10208 |
| 21 | 13148 |
| 22 | 5263 |
| 23 | 6829 |
| 24 | 2888 |
| 25 | 1965 |
| 26 | 2631 |
| 27 | 8414 |
| 28 | 13556 |
| 29 | 8861 |
| 3 | 5 |
| 30 | 21577 |
| 31 | 5178 |
| 32 | 14010 |
| 33 | 23899 |
| 34 | 6252 |
| 35 | 5748 |
| 36 | 3143 |
| 37 | 7174 |
| 38 | 7549 |
| 39 | 6829 |
| 4 | 5 |
| 40 | 6244 |
| 41 | 2007 |
| 42 | 2392 |
| 43 | 10843 |
| 44 | 14441 |
| 45 | 8705 |
| 46 | 11399 |
| 47 | 3300 |
| 48 | 13097 |
| 49 | 8183 |
| 5 | 5 |
| 50 | 5993 |
| 51 | 2395 |
| 52 | 4656 |
| 53 | 11261 |
| 54 | 11936 |
| 55 | 17969 |
| 56 | 7381 |
| 57 | 5128 |
| 58 | 5808 |
| 59 | 8778 |
| 6 | 4 |
| 60 | 23790 |
| 61 | 5167 |
| 62 | 4090 |
| 63 | 8926 |
| 64 | 8814 |
| 65 | 8312 |
| 66 | 6508 |
| 67 | 6481 |
| 68 | 7153 |
| 69 | 1139 |
| 7 | 6 |
| 70 | 12487 |
| 71 | 4003 |
| 72 | 5524 |
| 73 | 5510 |
| 74 | 6082 |
| 75 | 17706 |
| 76 | 12580 |
| 77 | 20879 |
| 78 | 15928 |
| 79 | 9346 |
| 8 | 15 |
| 80 | 20308 |
| 81 | 3792 |
| 82 | 3370 |
| 83 | 10088 |
| 84 | 18872 |
| 85 | 16923 |
| 86 | 2035 |
| 87 | 4524 |
| 88 | 5419 |
| 89 | 8308 |
| 9 | 24 |
| 90 | 25034 |
| 91 | 20052 |
| 92 | 32356 |
| 93 | 12858 |
| 94 | 17673 |
| 95 | 21038 |
| 96 | 5010 |
| 97 | 11083 |
| 98 | 20000 |
| 99 | 5832 |
Afterwards, I grouped the first first 2 digits of the zipcodes into groups of 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, and 90s. The first two digits of the zipcodes represents a region in the U.S. For example if 19 is extracted fom 19446, then 19 would represent a zip code from a Northeastern state like Pennsylvania. Afterwards, we will define a new population by deleting observations ranging 0 to 9 because some states states do not have a zip code or the size is relatively small in order to get the new smaller population.
We now find the loan default rates by industry defined by the stratification variable strZip. Loan default status is defined by the variable MIS_Status.
| no.lab | default | no.default | default.rate | |
|---|---|---|---|---|
| 0 | 0 | 222 | 61 | 78.4 |
| 1 | 0 | 1 | 23 | 4.2 |
| 109 | 758 | 18366 | 87620 | 17.3 |
| 2 | 0 | 4 | 7 | 36.4 |
| 209 | 273 | 12726 | 60764 | 17.3 |
| 3 | 0 | 1 | 4 | 20.0 |
| 309 | 139 | 22512 | 78708 | 22.2 |
| 4 | 0 | 0 | 5 | 0.0 |
| 409 | 236 | 14465 | 65910 | 18.0 |
| 5 | 0 | 0 | 5 | 0.0 |
| 509 | 87 | 8542 | 72676 | 10.5 |
| 6 | 0 | 1 | 3 | 25.0 |
| 609 | 218 | 13756 | 66406 | 17.2 |
| 7 | 0 | 1 | 5 | 16.7 |
| 709 | 83 | 20702 | 89260 | 18.8 |
| 8 | 0 | 6 | 9 | 40.0 |
| 809 | 49 | 16347 | 77243 | 17.5 |
| 9 | 0 | 4 | 20 | 16.7 |
| 909 | 154 | 29902 | 140880 | 17.5 |
| 109 | 209 | 309 | 409 | 509 | 609 | 709 | 809 | 909 |
|---|---|---|---|---|---|---|---|---|
| 106744 | 73763 | 101359 | 80611 | 81305 | 80380 | 110045 | 93639 | 170936 |
Now, we will implement three sampling plans. Each sampling plan contains 1000 observations. ## Simple Random Sampling We will take a sample of 1000 from the whole population of 899164. For this procedure, we are only taking a subset of 1000 Zip codes randomly, each with the same probability of being chosen. In the Sample, we have a variance of 30.
| Size | Var.count |
|---|---|
| 1000 | 30 |
We perform systematic sampling where it is taking every 1th
observation in the column for the variable Zip, the sample will consist
of everything 1th observation in the column ranging from n=1 t0 n=
899164. The jump size is calculated to find the appropriate jump size
from our population in order to obtain a sample of 1000 when we perform
systematic sampling. The equation 694216/1000 to get a jump size of 69.
We use sample()random take a record from the first 69
records and then select every 69rd record to include in the systematic
sample.
| Size | Var.count |
|---|---|
| 1000 | 30 |
Then, I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population, which are listed below.
| 109 | 209 | 309 | 409 | 509 | 609 | 709 | 809 | 909 |
|---|---|---|---|---|---|---|---|---|
| 119 | 82 | 113 | 90 | 90 | 89 | 122 | 104 | 190 |
The final code is taking 119 zip codes from the 109 group, 82 zip codes
from the 209 group, 113 zip codes from the 309 group, 91 zip coes from
the 409 group, 90 zip codes from the 509 group, 89 zip codes from the
609 group, 122 zip codes from the 809 group, and 190 zip codes from the
909 group to create a stratified sample of 1000.
| Size | Var.count |
|---|---|
| 1000 | 30 |
In this section, we perform a comparative analysis of the three random samples. One metric we can use is the default rate in each industry defined by the first two digits of the Zip codes.
We have calculated the default rate across the states previously. We will use the population level rates to compare them with sample-level industry specific rates. For the table, the MIS Status enables us to see how many people who live in particular zip codes defaulted on their loans and how many did not default on their loans on the 3rd and 4th columns. The last column gives us the percentage of people living within a particular set of zip codes who defaulted on their loans.
| no.lab | default | no.default | default.rate | |
|---|---|---|---|---|
| 0 | 0 | 222 | 61 | 78.4 |
| 1 | 0 | 1 | 23 | 4.2 |
| 109 | 758 | 18366 | 87620 | 17.3 |
| 2 | 0 | 4 | 7 | 36.4 |
| 209 | 273 | 12726 | 60764 | 17.3 |
| 3 | 0 | 1 | 4 | 20.0 |
| 309 | 139 | 22512 | 78708 | 22.2 |
| 4 | 0 | 0 | 5 | 0.0 |
| 409 | 236 | 14465 | 65910 | 18.0 |
| 5 | 0 | 0 | 5 | 0.0 |
| 509 | 87 | 8542 | 72676 | 10.5 |
| 6 | 0 | 1 | 3 | 25.0 |
| 609 | 218 | 13756 | 66406 | 17.2 |
| 7 | 0 | 1 | 5 | 16.7 |
| 709 | 83 | 20702 | 89260 | 18.8 |
| 8 | 0 | 6 | 9 | 40.0 |
| 809 | 49 | 16347 | 77243 | 17.5 |
| 9 | 0 | 4 | 20 | 16.7 |
| 909 | 154 | 29902 | 140880 | 17.5 |
To compare, we construct the following table that includes the industry-specific default rates. We will see that some of the industry-specific default rates seem to be different between SRS and Population. More visual comparisons will be given in the next section.
| default.rate.pop | default.rate.srs | |
|---|---|---|
| upper North East | 17.3 | 19.6 |
| Lower North East | 17.3 | 16.3 |
| Lower South East | 22.2 | 23.2 |
| Upper Mid East | 18.0 | 17.6 |
| Upper Middle | 10.5 | 11.8 |
| Center of U.S | 17.2 | 22.0 |
| South of U.S | 18.8 | 13.7 |
| Mid West | 17.5 | 16.7 |
| West Coast | 17.5 | 19.2 |
The table will have rates of population, SRS, and systematic random samples.
| default.rate.pop | default.rate.srs | default.rate.sys | |
|---|---|---|---|
| 109 | 17.3 | 19.6 | 17.3 |
| 209 | 17.3 | 16.3 | 14.8 |
| 309 | 22.2 | 23.2 | 22.8 |
| 409 | 18.0 | 17.6 | 26.8 |
| 509 | 10.5 | 11.8 | 10.9 |
| 609 | 17.2 | 22.0 | 10.6 |
| 709 | 18.8 | 13.7 | 17.7 |
| 809 | 17.5 | 16.7 | 18.2 |
| 909 | 17.5 | 19.2 | 20.5 |
Here, we will put all information in the following table.It seems that the stratified sample performs better than the SRS sample.
| default.rate.pop | default.rate.srs | default.rate.sys | default.rate.str | |
|---|---|---|---|---|
| upper North East | 17.3 | 19.6 | 17.3 | 20.3 |
| Lower North East | 17.3 | 16.3 | 14.8 | 19.5 |
| Lower South East | 22.2 | 23.2 | 22.8 | 20.4 |
| Upper Mid East | 18.0 | 17.6 | 26.8 | 17.6 |
| Upper Middle | 10.5 | 11.8 | 10.9 | 12.2 |
| Center of U.S | 17.2 | 22.0 | 10.6 | 15.7 |
| South of U.S | 18.8 | 13.7 | 17.7 | 18.9 |
| Mid West | 17.5 | 16.7 | 18.2 | 19.2 |
| West Coast | 17.5 | 19.2 | 20.5 | 15.8 |
First of all, we note that the above table of default rates based on
random samples are random. The follow observations are solely based on
this random table. In the previous section, we calculated the
industry-specific default rates for population, SRS, systematic, and
stratified samples. We now create a statistical graphs in order to
compare the default rates among the samples. The Stratification sample
may be the best fit for the model because the default rates that we got
from the stratified sample are closest to the default rates from the
population. However, we have not tested the significance of the
differences between the default rates between the population and
samples.
The above patterns of industry-specific default rates in the following line plot.
However, when we look at the overall performance among the three sampling plans based on these single-step samples, we look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The result is summarized in the bottom panel of the above figure. It turns out that the systematic sampling plan actually outperforms the SRS and stratified plans.
We have implemented the three sampling plans that are commonly used in practice based on large bank loan data. The Zip code was used to define the study population and the stratification variable for stratified sampling. The difference between population-level industry-specific default rates and sample-level rates was used to compare the performance of all three sampling plans. The comparison results were based on a one-step sample. There could be big variations. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean sqred errors.