1 Introduction

We are going to do an analysis by comparing 3 samplings plans which are simple random sampling, stratified sampling, and systematic sampling. The Bank load data is treated as a population that has 9 subsets. We turn these 9 subsets into one data set called bankloan.

2 Stratification variable

It is a categorical variable that is used to stratify a population depending on the values. Each value is in a subpopulation. When we sample of whole population, we take a sub samples from a subpopulations to create a sample from the whole population. In this example, I will show how Zipcodes can be modified into a stratification variable

Population.size
899164

I first tried to extract the first two digits from the zipcodes in order to create a frequency distribution of 2 digit zip codes and decide potential combinations of categories with smaller sizes

Var1	Freq
0	283
1	24
10	14759
11	19211
12	6749
13	6397
14	14310
15	10221
16	6293
17	6181
18	7530
19	15093
2	11
20	10208
21	13148
22	5263
23	6829
24	2888
25	1965
26	2631
27	8414
28	13556
29	8861
3	5
30	21577
31	5178
32	14010
33	23899
34	6252
35	5748
36	3143
37	7174
38	7549
39	6829
4	5
40	6244
41	2007
42	2392
43	10843
44	14441
45	8705
46	11399
47	3300
48	13097
49	8183
5	5
50	5993
51	2395
52	4656
53	11261
54	11936
55	17969
56	7381
57	5128
58	5808
59	8778
6	4
60	23790
61	5167
62	4090
63	8926
64	8814
65	8312
66	6508
67	6481
68	7153
69	1139
7	6
70	12487
71	4003
72	5524
73	5510
74	6082
75	17706
76	12580
77	20879
78	15928
79	9346
8	15
80	20308
81	3792
82	3370
83	10088
84	18872
85	16923
86	2035
87	4524
88	5419
89	8308
9	24
90	25034
91	20052
92	32356
93	12858
94	17673
95	21038
96	5010
97	11083
98	20000
99	5832

Afterwards, I grouped the first first 2 digits of the zipcodes into groups of 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, and 90s. The first two digits of the zipcodes represents a region in the U.S. For example if 19 is extracted fom 19446, then 19 would represent a zip code from a Northeastern state like Pennsylvania. Afterwards, we will define a new population by deleting observations ranging 0 to 9 because some states states do not have a zip code or the size is relatively small in order to get the new smaller population.

2.1 Loan Default Rates By Industry

We now find the loan default rates by industry defined by the stratification variable strZip. Loan default status is defined by the variable MIS_Status.

	no.lab	default	no.default	default.rate
0	0	222	61	78.4
1	0	1	23	4.2
109	758	18366	87620	17.3
2	0	4	7	36.4
209	273	12726	60764	17.3
3	0	1	4	20.0
309	139	22512	78708	22.2
4	0	0	5	0.0
409	236	14465	65910	18.0
5	0	0	5	0.0
509	87	8542	72676	10.5
6	0	1	3	25.0
609	218	13756	66406	17.2
7	0	1	5	16.7
709	83	20702	89260	18.8
8	0	6	9	40.0
809	49	16347	77243	17.5
9	0	4	20	16.7
909	154	29902	140880	17.5

2.2 Study Population

109	209	309	409	509	609	709	809	909
106744	73763	101359	80611	81305	80380	110045	93639	170936

3 Sampling Plans

Now, we will implement three sampling plans. Each sampling plan contains 1000 observations. ## Simple Random Sampling We will take a sample of 1000 from the whole population of 899164. For this procedure, we are only taking a subset of 1000 Zip codes randomly, each with the same probability of being chosen. In the Sample, we have a variance of 30.

Size	Var.count
1000	30

3.1 Systematic sampling

We perform systematic sampling where it is taking every 1th observation in the column for the variable Zip, the sample will consist of everything 1th observation in the column ranging from n=1 t0 n= 899164. The jump size is calculated to find the appropriate jump size from our population in order to obtain a sample of 1000 when we perform systematic sampling. The equation 694216/1000 to get a jump size of 69. We use sample()random take a record from the first 69 records and then select every 69rd record to include in the systematic sample.

Size	Var.count
1000	30

3.2 Stratified Sampling

Then, I ran a program to create a table for the stratified zipcodes. Later, this is creating a sample of 1000 by taking clusters of a particular number group in the population, which are listed below.

109	209	309	409	509	609	709	809	909
119	82	113	90	90	89	122	104	190

The final code is taking 119 zip codes from the 109 group, 82 zip codes from the 209 group, 113 zip codes from the 309 group, 91 zip coes from the 409 group, 90 zip codes from the 509 group, 89 zip codes from the 609 group, 122 zip codes from the 809 group, and 190 zip codes from the 909 group to create a stratified sample of 1000.

Size	Var.count
1000	30

4 Performance Analysis of Random Samples

In this section, we perform a comparative analysis of the three random samples. One metric we can use is the default rate in each industry defined by the first two digits of the Zip codes.

4.1 Population-level Default Rates

We have calculated the default rate across the states previously. We will use the population level rates to compare them with sample-level industry specific rates. For the table, the MIS Status enables us to see how many people who live in particular zip codes defaulted on their loans and how many did not default on their loans on the 3rd and 4th columns. The last column gives us the percentage of people living within a particular set of zip codes who defaulted on their loans.

Population size, default counts, and population default rates
	no.lab	default	no.default	default.rate
0	0	222	61	78.4
1	0	1	23	4.2
109	758	18366	87620	17.3
2	0	4	7	36.4
209	273	12726	60764	17.3
3	0	1	4	20.0
309	139	22512	78708	22.2
4	0	0	5	0.0
409	236	14465	65910	18.0
5	0	0	5	0.0
509	87	8542	72676	10.5
6	0	1	3	25.0
609	218	13756	66406	17.2
7	0	1	5	16.7
709	83	20702	89260	18.8
8	0	6	9	40.0
809	49	16347	77243	17.5
9	0	4	20	16.7
909	154	29902	140880	17.5

4.2 Industry-Specific Default Rates based on SRS

To compare, we construct the following table that includes the industry-specific default rates. We will see that some of the industry-specific default rates seem to be different between SRS and Population. More visual comparisons will be given in the next section.

Comparison of industry-specific default rates between population and the SRS.
	default.rate.pop	default.rate.srs
upper North East	17.3	19.6
Lower North East	17.3	16.3
Lower South East	22.2	23.2
Upper Mid East	18.0	17.6
Upper Middle	10.5	11.8
Center of U.S	17.2	22.0
South of U.S	18.8	13.7
Mid West	17.5	16.7
West Coast	17.5	19.2

4.3 Industry-specific Rates- Systematics Sample

The table will have rates of population, SRS, and systematic random samples.

Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
	default.rate.pop	default.rate.srs	default.rate.sys
109	17.3	19.6	17.3
209	17.3	16.3	14.8
309	22.2	23.2	22.8
409	18.0	17.6	26.8
509	10.5	11.8	10.9
609	17.2	22.0	10.6
709	18.8	13.7	17.7
809	17.5	16.7	18.2
909	17.5	19.2	20.5

4.4 Industry-specific Default Rates- Stratified Sample

Here, we will put all information in the following table.It seems that the stratified sample performs better than the SRS sample.

Comparison of industry-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
	default.rate.pop	default.rate.srs	default.rate.sys	default.rate.str
upper North East	17.3	19.6	17.3	20.3
Lower North East	17.3	16.3	14.8	19.5
Lower South East	22.2	23.2	22.8	20.4
Upper Mid East	18.0	17.6	26.8	17.6
Upper Middle	10.5	11.8	10.9	12.2
Center of U.S	17.2	22.0	10.6	15.7
South of U.S	18.8	13.7	17.7	18.9
Mid West	17.5	16.7	18.2	19.2
West Coast	17.5	19.2	20.5	15.8

5 Visual Comparison

First of all, we note that the above table of default rates based on random samples are random. The follow observations are solely based on this random table. In the previous section, we calculated the industry-specific default rates for population, SRS, systematic, and stratified samples. We now create a statistical graphs in order to compare the default rates among the samples. The Stratification sample may be the best fit for the model because the default rates that we got from the stratified sample are closest to the default rates from the population. However, we have not tested the significance of the differences between the default rates between the population and samples.

5.1 Mean squared error

The above patterns of industry-specific default rates in the following line plot.

However, when we look at the overall performance among the three sampling plans based on these single-step samples, we look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The result is summarized in the bottom panel of the above figure. It turns out that the systematic sampling plan actually outperforms the SRS and stratified plans.

6 Conclusion

We have implemented the three sampling plans that are commonly used in practice based on large bank loan data. The Zip code was used to define the study population and the stratification variable for stratified sampling. The difference between population-level industry-specific default rates and sample-level rates was used to compare the performance of all three sampling plans. The comparison results were based on a one-step sample. There could be big variations. A more reliable approach to obtaining a stable overall performance of the three sampling plans is to take multiple samples and compare the mean sqred errors.

Week 7: Ramdom Sampling and Performance Analysis

Yuanqi Zhang