1 Introduction

The data set that will be used for this analysis consists of almost 900,000 observations, corresponding to loan applications submitted to banks by small businesses with a partial warranty from the Small Business Association (SBA). It is available to the public via free internet download. From this data set four different samples of about 4,000 observations each will be taken using four different sampling methods (SRS, systematic, stratified, & cluster), and the loan default rates for four distinct subpopulations based on geographic region will be calculated based on each sample’s observations. These sample default rates will then be compared to one another as well as to the regional default rates observed across the entire population.

2 Data Preprocessing

The data is split up into nine subsets which will be combined back into one master data set as the first step in the preprocessing stage.

library(dplyr)
library(ggplot2)

loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)

As mentioned in the introduction, the initial population is 899,164 loan applications submitted by small businesses in the U.S. between 1987 and 2014 which were guaranteed to some degree by the SBA. First, any observation with a missing value for the loan status variable will be removed from the analytic data set.

loan <- subset(loan, !MIS_Status == "")
study.pop <- subset(loan, !Zip == "0")

Next, for sampling purposes later in the analysis, the data set will be stratified based on a variable indicating the U.S. geographic region (Northeast, Midwest, South, or West) of the borrower. This region variable will be created based on the existing variable which indicates the borrower’s state.

study.pop <- study.pop %>%
  mutate(Region = case_when(
    State %in% c("ME", "VT", "NH", "MA", "RI", "CT", "NY", "NJ", "PA") ~ "Northeast",
    State %in% c("WI", "MI", "IL", "IN", "OH", "ND", "SD", "NE", "KS", "MN", "IA", "MO") ~ "Midwest",
    State %in% c("DE", "MD", "DC", "VA", "WV", "NC", "SC", "GA", "FL", "KY", "TN", "AL", "MS", "AR", "LA", "OK", "TX") ~ "South",
    State %in% c("WA", "OR", "CA", "NV", "ID", "MT", "WY", "CO", "UT", "AZ", "NM", "AK", "HI") ~ "West",
    TRUE ~ "Other"
  ))

Of the 896,884 observations in the analytic data set at this point in the preprocessing stage, only 12 fell into the “Other” category of the new region variable. Because of the extreme scarcity of this category as well as the inability to meaningfully combine it with any of the other categories, these 12 observations will simply be removed from the analytic data set.

study.pop <- subset(study.pop, !Region == "Other")

Finally, a list of all the unique ZIP codes in the data will be generated, and the data set will be split into clusters based on these unique values of the variable. These clusters will be used for sampling later in the analysis.

unique.zipcodes <- unique(study.pop$Zip)
zip.clusters <- split(study.pop, study.pop$Zip)

With data preprocessing done, the study population can be defined as those observations from the original population which had some loan status listed and which fell into one of the four U.S. geographic regions defined by the newly created Region variable. There are 896,872 observations in this study population, from which each of the four samples will be taken.

3 Sampling & Calculating Regional Default Rates

3.1 Simple Random Sampling (SRS)

The first sampling method that will be used is simple random sampling. The chosen sample size is 4,000 observations. This will also serve as the target sample size for all of the samples.

set.seed(123)
study.pop$sampling.frame = 1:length(study.pop$GrAppv)   
sampled.list = sample(1:length(study.pop$GrAppv), 4000) 
SRS.sample = study.pop[sampled.list,]                  

With the sample random sample selected, the default rate for each of the four regional subpopulations can be calculated based on the selected observations.

region.default.SRS <- SRS.sample %>%
  group_by(Region) %>%
  summarize(default_rate = mean(MIS_Status == 'CHGOFF')*100)
kable(region.default.SRS)
Region default_rate
Midwest 17.54190
Northeast 16.60819
South 20.40073
West 16.92708

3.2 Systematic Sampling

The second sampling method which will be included in the comparison is systematic sampling. The jump size will be chosen so as to obtain as close to 4,000 observations as possible in the sample.

set.seed(123)
jump.size = dim(study.pop)[1]%/%4000  
rand.starting.pt=sample(1:jump.size,1)
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)
sys.sample=study.pop[sampling.id,]    
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
Size Var.count
4004 29

As can be seen from the table, this systematic sample contains a total of 4,004 observations, slightly more than the target sample size. Four observations will be removed from the sample at random to arrive at our final systematic sample with exactly 4,000 observations.

set.seed(123)
extra.obs <- sample(nrow(sys.sample), 4)
sys.sample.final <- sys.sample[-extra.obs, ]

Again, the default rates for each region can be calculated, this time based on the observations in the systematic sample.

region.default.sys <- sys.sample.final %>%
  group_by(Region) %>%
  summarize(default_rate = mean(MIS_Status == 'CHGOFF')*100)
kable(region.default.sys)
Region default_rate
Midwest 16.89840
Northeast 16.16628
South 21.55412
West 18.24687

3.3 Stratified Sampling

The next sample will be generated using the stratified sampling method, with Region as the stratification variable. The amount of observations in each of the strata will be proportional to the relative frequencies of the observations in each region across the entire study population.

freq.table = table(study.pop$Region)  
rel.freq = freq.table/sum(freq.table)    
strata.size = round(rel.freq*4000)      
strata.names=names(strata.size)
set.seed(123)
strata.sample = study.pop[1,]
strata.sample$add.id = 1
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i]
   ith.strata.size = strata.size[i]
   ith.sampling.id = which(study.pop$Region==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]
   ith.strata$add.id = 1:dim(ith.strata)[1] 
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   strata.sample = rbind(strata.sample, ith.sample)
 }
strat.sample.final = strata.sample[-1,]
strat.final.dim = dim(strat.sample.final)
names(strat.final.dim) = c("Size", "Var.count")
kable(t(strat.final.dim))
Size Var.count
3999 30

The final stratified sample contains a total of 3,999 observations. Because of its obvious proximity to the target sample size and the lack of a clear answer for how to randomly add an extra observation, this sample size will be accepted for the analysis.

Next, the regional default rates will be calculated based on the observations in this sample.

region.default.strat <- strat.sample.final %>%
  group_by(Region) %>%
  summarize(default_rate = mean(MIS_Status == 'CHGOFF')*100)

kable(region.default.strat)
Region default_rate
Midwest 14.53940
Northeast 16.66667
South 18.85246
West 16.72355

3.4 Cluster Sampling

The final sampling method that will be used is cluster sampling. For this method, a random sample will be taken from the list of unique ZIP codes generated in the data preprocessing phase, and the clusters of observations which correspond to the selected ZIP codes will be chosen and combined to create a new sample. Because of the significant variation in the sizes of the clusters, the total number of observations which comprise this new sample will almost certainly not be exactly 4,000. It was decided that 450 clusters would be chosen from the study population to comprise the cluster sample as this amount generally yields a sample size which is acceptably close to the target sample size.

set.seed(123)
sampled_clusters <- sample(unique.zipcodes, size = 450)
cluster.sample <- do.call(rbind, zip.clusters[sampled_clusters])
cluster.Sample.dim = dim(cluster.sample)
names(cluster.Sample.dim) = c("Size", "Var.count")
kable(t(cluster.Sample.dim))
Size Var.count
3971 28

This sampling method yielded a sample of 3,971 observations, which is reasonably close to the target sample size. Therefore this sample size will be accepted for the analysis.

Once again, the regional default rates in this sample are calculated.

region.default.cluster <- cluster.sample %>%
  group_by(Region) %>%
  summarize(default_rate = mean(MIS_Status == 'CHGOFF')*100)
kable(region.default.cluster)
Region default_rate
Midwest 14.87514
Northeast 18.14988
South 20.05420
West 15.84362

3.5 Study Population Default Rates

Finally, the default rates for each region based on the entire study population will be calculated for purposes of comparing with the results of the four sampling methods.

region.default.pop <- study.pop %>%
  group_by(Region) %>%
  summarize(default_rate = mean(MIS_Status == 'CHGOFF')*100)
kable(region.default.pop)
Region default_rate
Midwest 15.84033
Northeast 15.93425
South 20.52876
West 17.19110

4 Comparing Regional Default Rates By Sample

To compare the default rates determined from each sample to one another and to those of the entire study population, a combined bar chart will be created.

combined_df <- rbind(region.default.SRS, region.default.sys, region.default.strat, region.default.cluster, region.default.pop)
combined_df$Sample <- rep(c("SRS", "Systematic", "Stratified", "Cluster", "Population"), each = nrow(region.default.SRS))
combined_df$Sample <- factor(combined_df$Sample, levels = c("SRS", "Systematic", "Stratified", "Cluster", "Population"))

# Create the bar chart
combined_plot <- ggplot(combined_df, aes(x = Sample, y = default_rate, fill = Region)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of Regional Default Rates Across Samples/Population",
       x = "Sample",
       y = "Default Rate",
       fill = "Region") +
       ylim(0, 25)


combined_plot

It can be seen from this bar chart that the results from each of the four different sampling methods suggest that the Southern region of the U.S. exhibits the highest default rate, which aligns with the rates observed from the entire study population. Beyond that, however, the calculated default rates from each sample vary noticeably, both from one another as well as from the population rates. For instance, the simple random sample’s results suggest that the Midwest has the second highest default rate at about 17.5%, while looking only at the stratified sample would lead one to believe that it actually has the lowest default rate of all the regions, under 15%. In general, the calculated default rates for the Midwest vary significantly from sample to sample. Based on visual inspection of the chart, the estimates from the systematic sample appear to be the most similar to those of the entire population, although the systematic sample’s estimates for the South and West regions are noticeably higher than the rates observed in the population.

5 Conclusion

Again, because its estimates bear the closest resemblance to the population default rates, the systematic sample would be chosen as the analytic sample for further analysis based on the results of this process. That said, this is not necessarily an optimal sample and it may be advisable to try alternative sampling methods, sample sizes, stratification variables, etc. to generate a more representative sample.