Introduction

The U.S. Small Business Administration (SBA) collected historical data between 1987 and 2014 of loans guaranteed to some degree by the SBA. The variable FranchiseCode was used as a stratification variable to determine which of three types of random samples, simple, systematic, and stratified, would perform the best in an analysis of this data. Those observations that had franchise codes were divided into categories based off the first 2 digits of the code. Some categories were deleted due to small sizes or an absence of a franchise code, finalizing a study population. One of each kind of random sample, simple, systematic, and stratified, was created with 5000 observations. The default rates of each sample were used in a comparative performance analysis to determine which was best to use for the analysis. The systematic random sample performed better than the simple and stratified random samples, leading me to believe the systematic random sample is the best type to use for the analysis of this data.

Data

I used a U.S. Small Business Administration (SBA) dataset for my analysis, which uses data between the years of 1987 and 2014. I used three different kinds of samples for my analysis. I used a simple random sample, a systematic random sample, and a stratified random sample, each using a total of 5000 observations of loans guaranteed to some degree by the SBA. Of the 27 variables in the dataset, I decided to use the FranchiseCode variable as my stratification variable for the stratified random sample. Most of the observations did not have a franchise code, which limited the population I could use, which I believe helped the study. Even though most observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. I wanted to determine which of the three samples was best for this analysis.

loan01 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/w06-SBAnational09.csv", header = TRUE)[, -1]
bankLoan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)

Methods

After choosing FranchiseCode as my stratification variable, I will divide the codes into categories by their first two digits to make the data more manageable. I will then find the loan default rates by franchise code and delete categories that are unclassified or not large enough to be used via the inclusion rule. This will define the study population. After finding the population, I will run the three different types of samples listed above, each composed of 5000 observations. Using their default rates, I will run a comparative performance analysis on the three samples to determine which one is the best to use for this analysis.

Analysis

franchise =as.character(bankLoan$FranchiseCode) # make a character vector
franchise.2.digits = substr(bankLoan$FranchiseCode, 1, 2) # extract the first two digits of the NAICS code
bankLoan$FranchiseCode2Digit = franchise.2.digits  # add the above two-digit variable the loan data
FranchiseCode2Digit0 = bankLoan$FranchiseCode2Digit # extract the 2-digit NAICS
FranchiseCode2Digit =  bankLoan$FranchiseCode2Digit  # extract the 2-digit NAICS-copy
bankLoan$strFranchiseCode = FranchiseCode2Digit
del.categories = c("0", "1", "11", "12", "14", "27", "3", "86", "87", "92", "94", "95", "96", "97", "98", "99") # categories to be deleted in the original population
del.obs.status = !(bankLoan$strFranchiseCode %in% del.categories) # deletion status. ! negation operator
study.pop = bankLoan[del.obs.status,]  # excluding the categories
knitr::kable(t(table(study.pop$strFranchiseCode)))

10	13	15	16	17	18	19	20	21	22	23	24	25	26	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	88	89	90	91
2548	328	746	667	1571	320	434	779	1827	776	314	827	908	567	231	538	300	288	177	305	978	1098	512	244	849	546	231	408	388	480	328	694	370	236	537	407	1707	283	1374	700	411	476	682	246	203	218	359	710	336	389	679	927	736	1010	2688	273	577	257	489	735	377	795	276	493	3658	1247	632	858	268	338	745	191	312	626	607	297

I made FranchiseCode a stratification variable and divided the observation codes into categories by their first two digits. After checking the distributions of the 2-digit codes, I found the loan default rates by franchise code and deleted any categories that were unclassified or not large enough to be used via the inclusion rule. Doing this, I was able to define a study population to draw three different kinds of random samples from. Each sample was composed of 5000 loan observations.

Simple Random Sample

study.pop$sampling.frame = 1:length(study.pop$GrAppv) # sampling list
# checking the sampling list variable
sampled.list = sample(1:length(study.pop$GrAppv), 5000) # sampling the list
SRS.sample = study.pop[sampled.list,] # extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
knitr::kable(t(dimension.SRS)) # checking the sample size

Size	Var.count
5000	30

For the simple random sample, I randomly chose a sample of 5000 loans guaranteed to some degree by the SBA. I defined a sampling list and then added it to the study population.

Systematic Random Sample

jump.size = dim(study.pop)[1]%/%5000 # find the jump size in the systematic sampling
rand.starting.pt=sample(1:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)  # sampling IDs
sys.sample=study.pop[sampling.id,] # extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
knitr::kable(t(sys.Sample.dim))

Size	Var.count
5095	30

For the systematic random sample, I used a jump size to randomly chose a sample of 5000 loans guaranteed to some degree by the SBA at a regular interval.

Because the jump size involves rounding error and the population is large, the actual systematic sample size is slightly different from the target size. In this report, I used the integer part of the actual jump size. The actual systematic sampling size is slightly bigger than the target size. We can take away some records random from the systematic sample to make the size to be equal to the target size.

Stratified Random Sample

freq.table = table(study.pop$strFranchiseCode)  # frequency table of strFranchiseCode
rel.freq = freq.table/sum(freq.table)   # relative frequency 
strata.size = round(rel.freq*5000) # strata size allocation
strata.names=names(strata.size) # extract strFranchiseCode names for accuracy checking
strata.sample = study.pop[1,] # create a reference data frame
strata.sample$add.id = 1  # add a temporary ID to because in the loop
# i =2     # testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i] # extract data frame names
   ith.strata.size = strata.size[i] # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(study.pop$strFranchiseCode==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 # dim(strata.sample)
 strat.sample.final = strata.sample[-1,] # drop the temporary stratum ID 
knitr::kable(t(strata.size))  # makes a table of stratified data

10	13	15	16	17	18	19	20	21	22	23	24	25	26	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	88	89	90	91
250	32	73	65	154	31	43	76	179	76	31	81	89	56	23	53	29	28	17	30	96	108	50	24	83	54	23	40	38	47	32	68	36	23	53	40	168	28	135	69	40	47	67	24	20	21	35	70	33	38	67	91	72	99	264	27	57	25	48	72	37	78	27	48	359	122	62	84	26	33	73	19	31	61	60	29

For the stratified sample, I used FranchiseCode as the stratification variable. Even though most observations didn’t have a franchise code, the number that did made it easier for me to find an appropriate sample size. A SRS was taken from each FranchiseCode stratum. I separated the different codes by their first two numbers and deleted the categories that were either too small or represented no franchise. Not many observations were able to be used since they didn’t have a franchise code. This made the population smaller and easier to sample from.

Results

Using the default rates of the three samples, I ran the comparative performance analysis on the three samples to determine which one is the best to use for this analysis.

x.table = table(bankLoan$strFranchiseCode, bankLoan$MIS_Status)
no.lab = x.table[,1]   # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
 default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
x.table = table(SRS.sample$strFranchiseCode, SRS.sample$MIS_Status)
no.lab.srs = x.table[,1] # first column consists of unknown default label
default.srs = x.table[,2]
no.default.srs = x.table[,3]
default.rate.srs = round(100*default.srs/(default.srs+no.default.srs),1)
##
franchise.code = names(default.rate.srs)    # extract NSICS code
default.rate.pop = default.rate[franchise.code]
# cbind(industry.code,industry.name)
SRS.pop.rates = cbind(default.rate.pop,default.rate.srs)
x.table = table(sys.sample$strFranchiseCode, sys.sample$MIS_Status)
no.lab.sys = x.table[,1]      # first column consists of unknown default label
default.sys = x.table[,2]
no.default.sys = x.table[,3]
default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1)
sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys)
x.table = table(strat.sample.final$strFranchiseCode, strat.sample.final$MIS_Status)
no.lab.str = x.table[,1]      # first column consists of unknown default label
default.str = x.table[,2]
no.default.str = x.table[,3]
default.rate.str = round(100*default.str/(default.str+no.default.str),1)
str.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys, default.rate.str)
knitr::kable(str.SRS.pop.rates, caption="Comparison of franchise-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.")

Comparison of franchise-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
	default.rate.pop	default.rate.srs	default.rate.sys	default.rate.str
10	15.8	14.9	12.8	16.4
13	22.3	33.3	22.6	31.2
15	16.9	10.4	18.3	16.4
16	10.0	17.4	5.6	10.8
17	21.1	18.6	22.2	24.7
18	15.0	14.3	17.2	22.6
19	15.4	11.9	12.0	20.9
20	23.6	21.9	19.0	22.4
21	9.3	8.9	10.8	13.4
22	7.6	7.4	3.7	7.9
23	15.3	13.0	11.5	29.0
24	14.7	22.6	13.6	21.2
25	10.1	12.5	7.2	3.4
26	13.1	13.8	14.9	8.9
28	32.0	35.0	38.1	21.7
29	16.7	22.2	17.5	22.6
30	17.7	27.9	22.6	24.1
31	24.3	27.3	22.2	28.6
32	22.6	15.0	33.3	23.5
33	17.7	25.0	23.3	26.7
34	21.4	19.6	11.8	26.0
35	14.9	16.1	18.0	15.7
36	9.0	11.1	7.0	10.0
37	11.5	17.9	13.6	16.7
38	7.2	6.1	5.3	9.6
39	16.8	15.0	18.0	13.0
40	19.9	35.0	15.0	13.0
41	13.2	10.0	9.5	15.0
42	10.1	15.4	5.3	7.9
43	12.1	2.3	9.8	6.4
44	25.0	8.6	21.6	31.2
45	13.1	13.2	20.7	8.8
46	18.6	21.6	19.2	22.2
47	19.5	16.7	12.5	17.4
48	12.7	12.2	12.0	7.5
49	16.5	19.5	22.5	17.5
50	12.9	11.8	13.4	10.7
51	11.8	19.2	4.5	11.5
52	22.1	19.4	20.0	13.3
53	16.0	24.6	19.2	17.4
54	14.8	25.0	18.4	17.5
55	17.0	15.7	22.6	12.8
56	14.4	16.9	10.1	17.9
57	9.3	4.5	12.5	8.3
58	33.0	17.6	29.4	35.0
59	25.2	36.8	20.8	19.0
60	15.9	29.0	21.2	14.3
61	13.4	14.7	9.2	15.7
62	13.4	11.9	13.6	18.2
63	26.0	25.0	20.8	21.1
64	23.7	20.0	27.7	22.4
65	21.3	24.7	23.9	17.6
66	16.0	13.9	13.6	16.7
67	6.5	4.6	3.0	5.1
68	22.7	26.4	21.4	23.5
69	15.0	11.5	13.3	29.6
70	18.0	15.8	19.0	21.1
71	21.4	13.0	31.0	24.0
72	16.0	16.3	17.0	16.7
73	10.2	12.3	10.0	13.9
74	14.3	13.2	7.1	18.9
75	14.5	16.9	12.5	12.8
76	14.1	21.4	18.9	14.8
77	12.0	11.1	16.3	14.6
78	5.7	5.4	5.5	4.5
79	11.2	10.0	7.8	12.4
80	19.5	19.3	13.1	22.6
81	17.1	16.5	23.4	17.9
82	13.1	13.0	7.7	19.2
83	12.7	8.3	14.7	15.2
84	12.2	5.5	12.0	11.0
85	12.6	5.6	14.3	10.5
88	11.5	6.2	11.1	6.5
89	24.6	28.6	30.4	26.2
90	14.2	12.2	14.5	18.3
91	14.8	7.4	13.8	17.2

n=length(default.rate.pop)
plot(NULL, xlim=c(0,n), ylim=c(0, 50), 
     xlab="Franchise Code", 
     ylab ="Default Rates (Percentage)", axes=FALSE) # empty plot
# Light gray background
rect(par("usr")[1], par("usr")[3],
     par("usr")[2], par("usr")[4],
     col = "white")

# Add white grid
grid(nx = NULL, ny = NULL,
     col = "white", lwd = 1)
title("Comparison of Franchise-specific Default Rates Based on Random Samples")
points(1:n, as.vector(default.rate.pop), pch=16, col="gold3", cex = 0.8)
lines(1:n, as.vector(default.rate.pop),  lty=1, col="gold3", cex = 0.8)
#
points(1:n, as.vector(default.rate.srs), pch=17, col="darkorange1", cex = 0.8)
lines(1:n, as.vector(default.rate.srs), lty=2, col="darkorange1", cex = 0.8)
#
points(1:n, as.vector(default.rate.sys), pch=19, col="navyblue", cex = 0.8)
lines(1:n, as.vector(default.rate.sys), lty=3, col="navyblue", cex = 0.8)
#
points(1:n, as.vector(default.rate.str), pch=20, col="magenta", cex = 0.8)
lines(1:n, as.vector(default.rate.str), lty=4, col="magenta", cex = 0.8)
#
axis(1,at=1:n, label=franchise.code)
axis(2, las = 2)
#
clr = c("yellow","darkorange1","navyblue","magenta")
rowMax=apply(str.SRS.pop.rates, 1, max) # max default rate in each franchise
#segments(1:n, rep(0,n), 1:n, rowMax, lty=2, col="lightgray", lwd = 0.5)
legend(58, 50, c("Population", "Simple Random Sampling", "Systematic Sampling", "Stratified Sampling"), lty=rep(1,4), col=clr, pch=c(16,17,19,20), cex=0.6, bty="n")

After running a performance analysis on three different random samples, I found the systematic random sample to have performed better compared to the simple and stratified random samples. Therefore, I believe the systematic random sample is the best type of sample to use for the analysis of the SBA data.

Conclusion

Using the SBA dataset of loans and FranchiseCode as a stratification variable, I was able to determine which of the three samples, simple, systematic, and stratified, would have the best performance in a data analysis. A sample of 5000 observations was taken for each of the three samples. Examining the default rates of the three samples showed that the systematic random sample was best to use for this data analysis. Yet, many of the observations were not actually used in the analysis since they didn’t have a franchise code. I wonder, however, if the performance analysis of the three random samples would have turned out different if more of the loan observations in the dataset had actual franchise codes to use.

Appendix

x.table = table(bankLoan$strFranchiseCode, bankLoan$MIS_Status)
no.lab = x.table[,1]  # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
knitr::kable(default.status.rate, caption = "Population size, default counts, and population default rates")

Population size, default counts, and population default rates
	no.lab	default	no.default	default.rate
0	795	71175	136865	34.2
1	1159	78523	558872	12.3
10	0	402	2146	15.8
11	0	5	50	9.1
12	0	3	12	20.0
13	0	73	255	22.3
14	0	11	80	12.1
15	1	126	619	16.9
16	0	67	600	10.0
17	0	332	1239	21.1
18	0	48	272	15.0
19	0	67	367	15.4
20	0	184	595	23.6
21	3	169	1655	9.3
22	0	59	717	7.6
23	0	48	266	15.3
24	3	121	703	14.7
25	0	92	816	10.1
26	0	74	493	13.1
27	0	21	70	23.1
28	0	74	157	32.0
29	0	90	448	16.7
3	0	3	9	25.0
30	1	53	246	17.7
31	0	70	218	24.3
32	0	40	137	22.6
33	0	54	251	17.7
34	0	209	769	21.4
35	0	164	934	14.9
36	2	46	464	9.0
37	0	28	216	11.5
38	0	61	788	7.2
39	0	92	454	16.8
40	0	46	185	19.9
41	0	54	354	13.2
42	3	39	346	10.1
43	0	58	422	12.1
44	0	82	246	25.0
45	0	91	603	13.1
46	0	69	301	18.6
47	0	46	190	19.5
48	1	68	468	12.7
49	0	67	340	16.5
50	0	221	1486	12.9
51	20	31	232	11.8
52	0	303	1071	22.1
53	0	112	588	16.0
54	0	61	350	14.8
55	0	81	395	17.0
56	1	98	583	14.4
57	0	23	223	9.3
58	0	67	136	33.0
59	0	55	163	25.2
60	0	57	302	15.9
61	0	95	615	13.4
62	0	45	291	13.4
63	0	101	288	26.0
64	1	161	517	23.7
65	1	197	729	21.3
66	0	118	618	16.0
67	0	66	944	6.5
68	1	609	2078	22.7
69	0	41	232	15.0
70	0	104	473	18.0
71	0	55	202	21.4
72	0	78	411	16.0
73	1	75	659	10.2
74	0	54	323	14.3
75	1	115	679	14.5
76	0	39	237	14.1
77	0	59	434	12.0
78	0	209	3449	5.7
79	2	140	1105	11.2
80	0	123	509	19.5
81	0	147	711	17.1
82	0	35	233	13.1
83	0	43	295	12.7
84	0	91	654	12.2
85	0	24	167	12.6
86	0	15	67	18.3
87	0	8	45	15.1
88	0	36	276	11.5
89	0	154	472	24.6
90	1	86	520	14.2
91	0	44	253	14.8
92	0	13	72	15.3
94	0	35	33	51.5
95	0	6	57	9.5
96	0	1	17	5.6
97	0	10	38	20.8
98	0	13	138	8.6
99	0	0	1	0.0

I made this table to calculate default rates across the franchises. I decided to use the population-level franchise-specific rates as a reference and compare them with the sample-level default rates. This table also includes the category of loans with no franchise code.

x.table = table(sys.sample$strFranchiseCode, sys.sample$MIS_Status)
no.lab.sys = x.table[,1]      # first column consists of unknown default label
default.sys = x.table[,2]
no.default.sys = x.table[,3]
default.rate.sys = round(100*default.sys/(default.sys+no.default.sys),1)
sys.SRS.pop.rates = cbind(default.rate.pop, default.rate.srs, default.rate.sys)
knitr::kable(sys.SRS.pop.rates, caption="Comparison of industry-specific default rates between population, SRS, and Systematic Sample.")

Comparison of industry-specific default rates between population, SRS, and Systematic Sample.
	default.rate.pop	default.rate.srs	default.rate.sys
10	15.8	14.9	12.8
13	22.3	33.3	22.6
15	16.9	10.4	18.3
16	10.0	17.4	5.6
17	21.1	18.6	22.2
18	15.0	14.3	17.2
19	15.4	11.9	12.0
20	23.6	21.9	19.0
21	9.3	8.9	10.8
22	7.6	7.4	3.7
23	15.3	13.0	11.5
24	14.7	22.6	13.6
25	10.1	12.5	7.2
26	13.1	13.8	14.9
28	32.0	35.0	38.1
29	16.7	22.2	17.5
30	17.7	27.9	22.6
31	24.3	27.3	22.2
32	22.6	15.0	33.3
33	17.7	25.0	23.3
34	21.4	19.6	11.8
35	14.9	16.1	18.0
36	9.0	11.1	7.0
37	11.5	17.9	13.6
38	7.2	6.1	5.3
39	16.8	15.0	18.0
40	19.9	35.0	15.0
41	13.2	10.0	9.5
42	10.1	15.4	5.3
43	12.1	2.3	9.8
44	25.0	8.6	21.6
45	13.1	13.2	20.7
46	18.6	21.6	19.2
47	19.5	16.7	12.5
48	12.7	12.2	12.0
49	16.5	19.5	22.5
50	12.9	11.8	13.4
51	11.8	19.2	4.5
52	22.1	19.4	20.0
53	16.0	24.6	19.2
54	14.8	25.0	18.4
55	17.0	15.7	22.6
56	14.4	16.9	10.1
57	9.3	4.5	12.5
58	33.0	17.6	29.4
59	25.2	36.8	20.8
60	15.9	29.0	21.2
61	13.4	14.7	9.2
62	13.4	11.9	13.6
63	26.0	25.0	20.8
64	23.7	20.0	27.7
65	21.3	24.7	23.9
66	16.0	13.9	13.6
67	6.5	4.6	3.0
68	22.7	26.4	21.4
69	15.0	11.5	13.3
70	18.0	15.8	19.0
71	21.4	13.0	31.0
72	16.0	16.3	17.0
73	10.2	12.3	10.0
74	14.3	13.2	7.1
75	14.5	16.9	12.5
76	14.1	21.4	18.9
77	12.0	11.1	16.3
78	5.7	5.4	5.5
79	11.2	10.0	7.8
80	19.5	19.3	13.1
81	17.1	16.5	23.4
82	13.1	13.0	7.7
83	12.7	8.3	14.7
84	12.2	5.5	12.0
85	12.6	5.6	14.3
88	11.5	6.2	11.1
89	24.6	28.6	30.4
90	14.2	12.2	14.5
91	14.8	7.4	13.8

Before I created the table comparing default rates for all the samples, I made a table excluding the stratified random sample to compare the other two samples first. It turned out that the systematic sample performed much better compared to the simple random sample.

SBA Bank Loan Data Analysis Report

Andrew Heneghan

10/16/2022