Introduction

The U.S. Small Business Administration (SBA) collected historical data between 1987 and 2014 of loans guaranteed to some degree by the SBA. The variable FranchiseCode was used as a stratification variable to determine which of three types of random samples, simple, systematic, and stratified, would perform the best in an analysis of this data. Those observations that had franchise codes were divided into categories based off the first 2 digits of the code. Some categories were deleted due to small sizes or an absence of a franchise code, finalizing a study population. One of each kind of random sample, simple, systematic, and stratified, was created with 5000 observations. The default rates of each sample were used in a comparative performance analysis to determine which was best to use for the analysis. The systematic random sample performed better than the simple and stratified random samples, leading me to believe the systematic random sample is the best type to use for the analysis of this data.

Data

I used a U.S. Small Business Administration (SBA) dataset for my analysis, which uses data between the years of 1987 and 2014. I used three different kinds of samples for my analysis. I used a simple random sample, a systematic random sample, and a stratified random sample, each using a total of 5000 observations of loans guaranteed to some degree by the SBA. Of the 27 variables in the dataset, I decided to use the FranchiseCode variable as my stratification variable for the stratified random sample. Most of the observations did not have a franchise code, which limited the population I could use, which I believe helped the study. Even though most observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. I wanted to determine which of the three samples was best for this analysis.

loan01 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational09.csv", header = TRUE)[, -1]
bankLoan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)

Methods

After choosing FranchiseCode as my stratification variable, I will divide the codes into categories by their first two digits to make the data more manageable. I will then find the loan default rates by franchise code and delete categories that are unclassified or not large enough to be used via the inclusion rule. This will define the study population. After finding the population, I will run the three different types of samples listed above, each composed of 5000 observations. Using their default rates, I will run a comparative performance analysis on the three samples to determine which one is the best to use for this analysis.

Analysis

franchise =as.character(bankLoan$FranchiseCode) # make a character vector
franchise.2.digits = substr(bankLoan$FranchiseCode, 1, 2) # extract the first two digits of the NAICS code
bankLoan$FranchiseCode2Digit = franchise.2.digits  # add the above two-digit variable the loan data
FranchiseCode2Digit0 = bankLoan$FranchiseCode2Digit # extract the 2-digit NAICS
FranchiseCode2Digit =  bankLoan$FranchiseCode2Digit  # extract the 2-digit NAICS-copy
bankLoan$strFranchiseCode = FranchiseCode2Digit
del.categories = c("0", "1", "11", "12", "14", "27", "3", "86", "87", "92", "94", "95", "96", "97", "98", "99") # categories to be deleted in the original population
del.obs.status = !(bankLoan$strFranchiseCode %in% del.categories) # deletion status. ! negation operator
study.pop = bankLoan[del.obs.status,]  # excluding the categories
knitr::kable(t(table(study.pop$strFranchiseCode)))
10 13 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 88 89 90 91
2548 328 746 667 1571 320 434 779 1827 776 314 827 908 567 231 538 300 288 177 305 978 1098 512 244 849 546 231 408 388 480 328 694 370 236 537 407 1707 283 1374 700 411 476 682 246 203 218 359 710 336 389 679 927 736 1010 2688 273 577 257 489 735 377 795 276 493 3658 1247 632 858 268 338 745 191 312 626 607 297

I made FranchiseCode a stratification variable and divided the observation codes into categories by their first two digits. After checking the distributions of the 2-digit codes, I found the loan default rates by franchise code and deleted any categories that were unclassified or not large enough to be used via the inclusion rule. Doing this, I was able to define a study population to draw three different kinds of random samples from. Each sample was composed of 5000 loan observations.

Simple Random Sample

study.pop$sampling.frame = 1:length(study.pop$GrAppv) # sampling list
# checking the sampling list variable
sampled.list = sample(1:length(study.pop$GrAppv), 5000) # sampling the list
SRS.sample = study.pop[sampled.list,] # extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
knitr::kable(t(dimension.SRS)) # checking the sample size
Size Var.count
5000 30

For the simple random sample, I randomly chose a sample of 5000 loans guaranteed to some degree by the SBA. I defined a sampling list and then added it to the study population.

Systematic Random Sample

jump.size = dim(study.pop)[1]%/%5000 # find the jump size in the systematic sampling
rand.starting.pt=sample(1:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)  # sampling IDs
sys.sample=study.pop[sampling.id,] # extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
knitr::kable(t(sys.Sample.dim))
Size Var.count
5095 30

For the systematic random sample, I used a jump size to randomly chose a sample of 5000 loans guaranteed to some degree by the SBA at a regular interval.

Because the jump size involves rounding error and the population is large, the actual systematic sample size is slightly different from the target size. In this report, I used the integer part of the actual jump size. The actual systematic sampling size is slightly bigger than the target size. We can take away some records random from the systematic sample to make the size to be equal to the target size.

Stratified Random Sample

freq.table = table(study.pop$strFranchiseCode)  # frequency table of strFranchiseCode
rel.freq = freq.table/sum(freq.table)   # relative frequency 
strata.size = round(rel.freq*5000) # strata size allocation
strata.names=names(strata.size) # extract strFranchiseCode names for accuracy checking
strata.sample = study.pop[1,] # create a reference data frame
strata.sample$add.id = 1  # add a temporary ID to because in the loop
# i =2     # testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i] # extract data frame names
   ith.strata.size = strata.size[i] # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(study.pop$strFranchiseCode==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 # dim(strata.sample)
 strat.sample.final = strata.sample[-1,] # drop the temporary stratum ID 
knitr::kable(t(strata.size))  # makes a table of stratified data
10 13 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 88 89 90 91
250 32 73 65 154 31 43 76 179 76 31 81 89 56 23 53 29 28 17 30 96 108 50 24 83 54 23 40 38 47 32 68 36 23 53 40 168 28 135 69 40 47 67 24 20 21 35 70 33 38 67 91 72 99 264 27 57 25 48 72 37 78 27 48 359 122 62 84 26 33 73 19 31 61 60 29

For the stratified sample, I used FranchiseCode as the stratification variable. Even though most observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. A SRS was taken from each FranchiseCode stratum. I separated the different codes by their first two numbers and deleted the categories that were either too small or represented no franchise. Not many observations were able to be used since they didn’t have a franchise code. This made the population smaller and easier to sample from.

Results

Using the default rates of the three samples, I ran the comparative performance analysis on the three samples to determine which one is the best to use for this analysis.

x.table = table(bankLoan$strFranchiseCode, bankLoan$MIS_Status)
no.lab = x.table[,1]   # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
 default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
x.table = table(SRS.sample$strFranchiseCode, SRS.sample$MIS_Status)
no.lab.srs = x.table[,1] # first column consists of unknown default label
default.srs = x.table[,2]
no.default.srs = x.table[,3]
default.rate.srs = round(100*default.srs/(default.srs+no.default.srs),1)
##
franchise.code = names(default.rate.srs)    # extract NSICS code
default.rate.pop = default.rate[franchise.code]
# cbind(industry.code,industry.name)
SRS.pop.rates = cbind(default.rate.pop,default.rate.srs)
x.table = table(sys.sample$strFranchiseCode, sys.sample$MIS_Status)
no.lab.sys = x.table[,1]      # first column consists of unknown default label
default.sys = x.table[,2]
no.default.sys = x.table[,3]
default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1)
sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys)
x.table = table(strat.sample.final$strFranchiseCode, strat.sample.final$MIS_Status)
no.lab.str = x.table[,1]      # first column consists of unknown default label
default.str = x.table[,2]
no.default.str = x.table[,3]
default.rate.str = round(100*default.str/(default.str+no.default.str),1)
str.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys, default.rate.str)
knitr::kable(str.SRS.pop.rates, caption="Comparison of franchise-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.")
Comparison of franchise-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
default.rate.pop default.rate.srs default.rate.sys default.rate.str
10 15.8 14.9 12.8 16.4
13 22.3 33.3 22.6 31.2
15 16.9 10.4 18.3 16.4
16 10.0 17.4 5.6 10.8
17 21.1 18.6 22.2 24.7
18 15.0 14.3 17.2 22.6
19 15.4 11.9 12.0 20.9
20 23.6 21.9 19.0 22.4
21 9.3 8.9 10.8 13.4
22 7.6 7.4 3.7 7.9
23 15.3 13.0 11.5 29.0
24 14.7 22.6 13.6 21.2
25 10.1 12.5 7.2 3.4
26 13.1 13.8 14.9 8.9
28 32.0 35.0 38.1 21.7
29 16.7 22.2 17.5 22.6
30 17.7 27.9 22.6 24.1
31 24.3 27.3 22.2 28.6
32 22.6 15.0 33.3 23.5
33 17.7 25.0 23.3 26.7
34 21.4 19.6 11.8 26.0
35 14.9 16.1 18.0 15.7
36 9.0 11.1 7.0 10.0
37 11.5 17.9 13.6 16.7
38 7.2 6.1 5.3 9.6
39 16.8 15.0 18.0 13.0
40 19.9 35.0 15.0 13.0
41 13.2 10.0 9.5 15.0
42 10.1 15.4 5.3 7.9
43 12.1 2.3 9.8 6.4
44 25.0 8.6 21.6 31.2
45 13.1 13.2 20.7 8.8
46 18.6 21.6 19.2 22.2
47 19.5 16.7 12.5 17.4
48 12.7 12.2 12.0 7.5
49 16.5 19.5 22.5 17.5
50 12.9 11.8 13.4 10.7
51 11.8 19.2 4.5 11.5
52 22.1 19.4 20.0 13.3
53 16.0 24.6 19.2 17.4
54 14.8 25.0 18.4 17.5
55 17.0 15.7 22.6 12.8
56 14.4 16.9 10.1 17.9
57 9.3 4.5 12.5 8.3
58 33.0 17.6 29.4 35.0
59 25.2 36.8 20.8 19.0
60 15.9 29.0 21.2 14.3
61 13.4 14.7 9.2 15.7
62 13.4 11.9 13.6 18.2
63 26.0 25.0 20.8 21.1
64 23.7 20.0 27.7 22.4
65 21.3 24.7 23.9 17.6
66 16.0 13.9 13.6 16.7
67 6.5 4.6 3.0 5.1
68 22.7 26.4 21.4 23.5
69 15.0 11.5 13.3 29.6
70 18.0 15.8 19.0 21.1
71 21.4 13.0 31.0 24.0
72 16.0 16.3 17.0 16.7
73 10.2 12.3 10.0 13.9
74 14.3 13.2 7.1 18.9
75 14.5 16.9 12.5 12.8
76 14.1 21.4 18.9 14.8
77 12.0 11.1 16.3 14.6
78 5.7 5.4 5.5 4.5
79 11.2 10.0 7.8 12.4
80 19.5 19.3 13.1 22.6
81 17.1 16.5 23.4 17.9
82 13.1 13.0 7.7 19.2
83 12.7 8.3 14.7 15.2
84 12.2 5.5 12.0 11.0
85 12.6 5.6 14.3 10.5
88 11.5 6.2 11.1 6.5
89 24.6 28.6 30.4 26.2
90 14.2 12.2 14.5 18.3
91 14.8 7.4 13.8 17.2
n=length(default.rate.pop)
plot(NULL, xlim=c(0,n), ylim=c(0, 50), 
     xlab="Franchise Code", 
     ylab ="Default Rates (Percentage)", axes=FALSE) # empty plot
# Light gray background
rect(par("usr")[1], par("usr")[3],
     par("usr")[2], par("usr")[4],
     col = "white")

# Add white grid
grid(nx = NULL, ny = NULL,
     col = "white", lwd = 1)
title("Comparison of Franchise-specific Default Rates Based on Random Samples")
points(1:n, as.vector(default.rate.pop), pch=16, col="gold3", cex = 0.8)
lines(1:n, as.vector(default.rate.pop),  lty=1, col="gold3", cex = 0.8)
#
points(1:n, as.vector(default.rate.srs), pch=17, col="darkorange1", cex = 0.8)
lines(1:n, as.vector(default.rate.srs), lty=2, col="darkorange1", cex = 0.8)
#
points(1:n, as.vector(default.rate.sys), pch=19, col="navyblue", cex = 0.8)
lines(1:n, as.vector(default.rate.sys), lty=3, col="navyblue", cex = 0.8)
#
points(1:n, as.vector(default.rate.str), pch=20, col="magenta", cex = 0.8)
lines(1:n, as.vector(default.rate.str), lty=4, col="magenta", cex = 0.8)
#
axis(1,at=1:n, label=franchise.code)
axis(2, las = 2)
#
clr = c("yellow","darkorange1","navyblue","magenta")
rowMax=apply(str.SRS.pop.rates, 1, max) # max default rate in each franchise
#segments(1:n, rep(0,n), 1:n, rowMax, lty=2, col="lightgray", lwd = 0.5)
legend(58, 50, c("Population", "Simple Random Sampling", "Systematic Sampling", "Stratified Sampling"), lty=rep(1,4), col=clr, pch=c(16,17,19,20), cex=0.6, bty="n")

After running a performance analysis on three different random samples, I found the systematic random sample to have performed better compared to the simple and stratified random samples. Therefore, I believe the systematic random sample is the best type of sample to use for the analysis of the SBA data.

Conclusion

Using the SBA dataset of loans and FranchiseCode as a stratification variable, I was able to determine which of the three samples, simple, systematic, and stratified, would have the best performance in a data analysis. A sample of 5000 observations was taken for each of the three samples. Examining the default rates of the three samples showed that the systematic random sample was best to use for this data analysis. Yet, many of the observations were not actually used in the analysis since they didn’t have a franchise code. I wonder, however, if the performance analysis of the three random samples would have turned out different if more of the loan observations in the dataset had actual franchise codes to use.

Appendix

x.table = table(bankLoan$strFranchiseCode, bankLoan$MIS_Status)
no.lab = x.table[,1]  # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
knitr::kable(default.status.rate, caption = "Population size, default counts, and population default rates")
Population size, default counts, and population default rates
no.lab default no.default default.rate
0 795 71175 136865 34.2
1 1159 78523 558872 12.3
10 0 402 2146 15.8
11 0 5 50 9.1
12 0 3 12 20.0
13 0 73 255 22.3
14 0 11 80 12.1
15 1 126 619 16.9
16 0 67 600 10.0
17 0 332 1239 21.1
18 0 48 272 15.0
19 0 67 367 15.4
20 0 184 595 23.6
21 3 169 1655 9.3
22 0 59 717 7.6
23 0 48 266 15.3
24 3 121 703 14.7
25 0 92 816 10.1
26 0 74 493 13.1
27 0 21 70 23.1
28 0 74 157 32.0
29 0 90 448 16.7
3 0 3 9 25.0
30 1 53 246 17.7
31 0 70 218 24.3
32 0 40 137 22.6
33 0 54 251 17.7
34 0 209 769 21.4
35 0 164 934 14.9
36 2 46 464 9.0
37 0 28 216 11.5
38 0 61 788 7.2
39 0 92 454 16.8
40 0 46 185 19.9
41 0 54 354 13.2
42 3 39 346 10.1
43 0 58 422 12.1
44 0 82 246 25.0
45 0 91 603 13.1
46 0 69 301 18.6
47 0 46 190 19.5
48 1 68 468 12.7
49 0 67 340 16.5
50 0 221 1486 12.9
51 20 31 232 11.8
52 0 303 1071 22.1
53 0 112 588 16.0
54 0 61 350 14.8
55 0 81 395 17.0
56 1 98 583 14.4
57 0 23 223 9.3
58 0 67 136 33.0
59 0 55 163 25.2
60 0 57 302 15.9
61 0 95 615 13.4
62 0 45 291 13.4
63 0 101 288 26.0
64 1 161 517 23.7
65 1 197 729 21.3
66 0 118 618 16.0
67 0 66 944 6.5
68 1 609 2078 22.7
69 0 41 232 15.0
70 0 104 473 18.0
71 0 55 202 21.4
72 0 78 411 16.0
73 1 75 659 10.2
74 0 54 323 14.3
75 1 115 679 14.5
76 0 39 237 14.1
77 0 59 434 12.0
78 0 209 3449 5.7
79 2 140 1105 11.2
80 0 123 509 19.5
81 0 147 711 17.1
82 0 35 233 13.1
83 0 43 295 12.7
84 0 91 654 12.2
85 0 24 167 12.6
86 0 15 67 18.3
87 0 8 45 15.1
88 0 36 276 11.5
89 0 154 472 24.6
90 1 86 520 14.2
91 0 44 253 14.8
92 0 13 72 15.3
94 0 35 33 51.5
95 0 6 57 9.5
96 0 1 17 5.6
97 0 10 38 20.8
98 0 13 138 8.6
99 0 0 1 0.0

I made this table to calculate default rates across the franchises. I decided to use the population-level franchise-specific rates as a reference and compare them with the sample-level default rates. This table also includes the category of loans with no franchise code.

x.table = table(sys.sample$strFranchiseCode, sys.sample$MIS_Status)
no.lab.sys = x.table[,1]      # first column consists of unknown default label
default.sys = x.table[,2]
no.default.sys = x.table[,3]
default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1)
sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys)
knitr::kable(sys.SRS.pop.rates, caption="Comparison of industry-specific default rates between population, SRS, and Systematic Sample.")
Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
default.rate.pop default.rate.srs default.rate.sys
10 15.8 14.9 12.8
13 22.3 33.3 22.6
15 16.9 10.4 18.3
16 10.0 17.4 5.6
17 21.1 18.6 22.2
18 15.0 14.3 17.2
19 15.4 11.9 12.0
20 23.6 21.9 19.0
21 9.3 8.9 10.8
22 7.6 7.4 3.7
23 15.3 13.0 11.5
24 14.7 22.6 13.6
25 10.1 12.5 7.2
26 13.1 13.8 14.9
28 32.0 35.0 38.1
29 16.7 22.2 17.5
30 17.7 27.9 22.6
31 24.3 27.3 22.2
32 22.6 15.0 33.3
33 17.7 25.0 23.3
34 21.4 19.6 11.8
35 14.9 16.1 18.0
36 9.0 11.1 7.0
37 11.5 17.9 13.6
38 7.2 6.1 5.3
39 16.8 15.0 18.0
40 19.9 35.0 15.0
41 13.2 10.0 9.5
42 10.1 15.4 5.3
43 12.1 2.3 9.8
44 25.0 8.6 21.6
45 13.1 13.2 20.7
46 18.6 21.6 19.2
47 19.5 16.7 12.5
48 12.7 12.2 12.0
49 16.5 19.5 22.5
50 12.9 11.8 13.4
51 11.8 19.2 4.5
52 22.1 19.4 20.0
53 16.0 24.6 19.2
54 14.8 25.0 18.4
55 17.0 15.7 22.6
56 14.4 16.9 10.1
57 9.3 4.5 12.5
58 33.0 17.6 29.4
59 25.2 36.8 20.8
60 15.9 29.0 21.2
61 13.4 14.7 9.2
62 13.4 11.9 13.6
63 26.0 25.0 20.8
64 23.7 20.0 27.7
65 21.3 24.7 23.9
66 16.0 13.9 13.6
67 6.5 4.6 3.0
68 22.7 26.4 21.4
69 15.0 11.5 13.3
70 18.0 15.8 19.0
71 21.4 13.0 31.0
72 16.0 16.3 17.0
73 10.2 12.3 10.0
74 14.3 13.2 7.1
75 14.5 16.9 12.5
76 14.1 21.4 18.9
77 12.0 11.1 16.3
78 5.7 5.4 5.5
79 11.2 10.0 7.8
80 19.5 19.3 13.1
81 17.1 16.5 23.4
82 13.1 13.0 7.7
83 12.7 8.3 14.7
84 12.2 5.5 12.0
85 12.6 5.6 14.3
88 11.5 6.2 11.1
89 24.6 28.6 30.4
90 14.2 12.2 14.5
91 14.8 7.4 13.8

Before I created the table comparing default rates for all the samples, I made a table excluding the stratified random sample to compare the other two samples first. It turned out that the systematic sample performed much better compared to the simple random sample.