1 Introduction

This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.

1.1 Variable Inspection

Next, we’re going to take a look at the variables we have, what they represent, and what type they are.

Variable Name	Variable Type	Description
LoanNr_ChkDgt	numeric	Identifier (ID) Variable
Name	character	Borrower Name
City	character	Borrower City
State	character	Borrower State
Zip	integer	Borrower Zip Code
Bank	character	Bank Name
BankState	character	Bank State
NAICS	integer	North American Industry Classification System code
ApprovalDate	character	Date SBA Commitment Issued
ApprovalFY	character	Fiscal Year of Commitment
Term	integer	Loan term in months
NoEmp	integer	Number of Employees
NewExist	integer	Business Existing (= 1) or New (= 2)
CreateJob	integer	Number of jobs created
RetainedJob	integer	Number of jobs retained
FranchiseCode	integer	Franchise Code; 0 = Franchise, 1 = No Franchise
UrbanRural	integer	Urban = 1, Rural = 2, Undefined = 0
RevLineCr	character	Revolving Credit; Y = Yes, N = No
LowDoc	character	LowDoc Loan; Y = Yes, N = No
ChgOffDate	character	Date loan is declared defaulted
DisbursementDate	character	Disbursement Date
DisbursementGross	character	Disbursement Amount
BalanceGross	character	Gross amount oustanding
MIS_Status	character	Loan Status; Paid off or Default
ChgOffPrinGr	character	Charged off Amount
GrAppv	character	Gross Approved Amount
SBA_Appv	character	SBA’s Guaranteed Amount

We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.

2 Data Preprocessing/Sampling Preparation

Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.

2.1 Remove Missing Values

The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.

loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))

kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))

Total Number of Missing MIS_Status Values
1997

Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed

Final Number of Missing MIS_Status Values
0

2.2 Discretizing States by Geographic Regions

Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.

US Geographic Regions

We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.

Region	Count
Great Plains	37439
Mid-Atlantic	133249
Midwest	174646
New England	69547
Non-Contiguous	6009
Rocky Mountain	60427
South	148868
Southwest	102059
West Coast	164771

By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.

2.3 Clustering Zip Code

We chose to cluster the Zip Code variable because zip codes have geographic proximity that oftentimes have similar demographics and socioeconomic status. This makes it ideal as we want to make each cluster’s loan observations as similar to each other as possible.

The method of clustering chosen was taking Zip Codes that were similar numerically to each other (i.e. the first 3 digits were identical) and grouping all Zip Codes into one cluster.

3 Sampling

We will perform 4 types of sampling on the loan data: Simple Random Sample, Systematic Sample, Stratified Sample, and Cluster Sample using Zip Code.

3.1 Simple Random Sample

In simple random sampling, every member of the population has an equal chance of being selected. We set a seed to make our results reproducible and used the sample function to randomly select 1300 observations from the new_loan data set. This type of sampling is beneficial because it ensures that every observation has the same probability of selection, which minimizes bias and makes the sample representative of the population.

3.2 Systematic Random Sample

Systematic sampling involves selecting members from a larger population according to a random starting point and a fixed, periodic interval. This technique ensures that the population is evenly sampled and is often simpler and more straightforward than simple random sampling. It’s particularly useful when a complete list of all members of the population is available. In this case, we calculate the interval by dividing the population size by the desired sample size, choose a random start within the first interval, and then select every nth observation thereafter.

3.3 Stratified Sample

Stratified sampling is a method where the population is divided into homogeneous subgroups, known as strata, and random samples are taken from each stratum. The strata_sizes vector is used to ensure that the sample size for each stratum is proportional to the stratum’s size within the population. This method can provide greater precision than simple random sampling by ensuring that specific subgroups are adequately represented in the sample. This can be particularly important if we expect that the measurements could differ by subgroup.

3.4 Cluster Sample

Cluster sampling is a technique where the population is divided into separate groups, or clusters. A random sample of these clusters is then selected for analysis. In our case, we define each unique ZIP code as a cluster and then randomly select a number of these clusters. This method is advantageous when it is costly or impractical to conduct a census of the entire population. It is particularly useful when the population is spread out geographically and individual elements are not conveniently accessible.

4 Summary

In this study, we leveraged the SBA loan dataset to compare different sampling techniques and their applicability to real-world data analysis scenarios. We began by preprocessing the dataset, which involved handling missing values and reclassifying the state variable into broader geographic regions. Then, we undertook four distinct sampling strategies—simple random, systematic, stratified, and cluster sampling—each chosen for its unique benefits and alignment with our data characteristics. Simple random sampling allowed for unbiased representation, while systematic sampling provided an evenly distributed selection. Stratified sampling enhanced the accuracy by considering sub-group proportions, and cluster sampling with ZIP codes took advantage of natural geographical divisions. These methodologies, underpinned by robust statistical principles, aimed to yield insights reflective of the larger population of small business loan applications.

Comparing Sampling Plans through SBA loan data

Joshua Zhong

03/10/2024