1 Introduction

This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.

1.1 Variable Inspection

Next, we’re going to take a look at the variables we have, what they represent, and what type they are.

Variable Name Variable Type Description
LoanNr_ChkDgt numeric Identifier (ID) Variable
Name character Borrower Name
City character Borrower City
State character Borrower State
Zip integer Borrower Zip Code
Bank character Bank Name
BankState character Bank State
NAICS integer North American Industry Classification System code
ApprovalDate character Date SBA Commitment Issued
ApprovalFY character Fiscal Year of Commitment
Term integer Loan term in months
NoEmp integer Number of Employees
NewExist integer Business Existing (= 1) or New (= 2)
CreateJob integer Number of jobs created
RetainedJob integer Number of jobs retained
FranchiseCode integer Franchise Code; 0 = Franchise, 1 = No Franchise
UrbanRural integer Urban = 1, Rural = 2, Undefined = 0
RevLineCr character Revolving Credit; Y = Yes, N = No
LowDoc character LowDoc Loan; Y = Yes, N = No
ChgOffDate character Date loan is declared defaulted
DisbursementDate character Disbursement Date
DisbursementGross character Disbursement Amount
BalanceGross character Gross amount oustanding
MIS_Status character Loan Status; Paid off or Default
ChgOffPrinGr character Charged off Amount
GrAppv character Gross Approved Amount
SBA_Appv character SBA’s Guaranteed Amount

We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.

2 Data Preprocessing/Sampling Preparation

Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.

2.1 Remove Missing Values

The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.

loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))

kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))
Total Number of Missing MIS_Status Values
1997

Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed

Final Number of Missing MIS_Status Values
0

2.2 Discretizing States by Geographic Regions

Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.

US Geographic Regions
US Geographic Regions

We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.

Region Count
Great Plains 37439
Mid-Atlantic 133249
Midwest 174646
New England 69547
Non-Contiguous 6009
Rocky Mountain 60427
South 148868
Southwest 102059
West Coast 164771

By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.

2.3 Clustering Zip Code

We chose to cluster the Zip Code variable because zip codes have geographic proximity that oftentimes have similar demographics and socioeconomic status. This makes it ideal as we want to make each cluster’s loan observations as similar to each other as possible.

The method of clustering chosen was taking Zip Codes that were similar numerically to each other (i.e. the first 3 digits were identical) and grouping all Zip Codes into one cluster.

3 Sampling

We will perform 4 types of sampling on the loan data: Simple Random Sample, Systematic Sample, Stratified Sample, and Cluster Sample using Zip Code.

3.1 Simple Random Sample

In simple random sampling, every member of the population has an equal chance of being selected. We set a seed to make our results reproducible and used the sample function to randomly select 1300 observations from the new_loan data set. This type of sampling is beneficial because it ensures that every observation has the same probability of selection, which minimizes bias and makes the sample representative of the population.

3.2 Systematic Random Sample

Systematic sampling involves selecting members from a larger population according to a random starting point and a fixed, periodic interval. This technique ensures that the population is evenly sampled and is often simpler and more straightforward than simple random sampling. It’s particularly useful when a complete list of all members of the population is available. In this case, we calculate the interval by dividing the population size by the desired sample size, choose a random start within the first interval, and then select every nth observation thereafter.

3.3 Stratified Sample

Stratified sampling is a method where the population is divided into homogeneous subgroups, known as strata, and random samples are taken from each stratum. The strata_sizes vector is used to ensure that the sample size for each stratum is proportional to the stratum’s size within the population. This method can provide greater precision than simple random sampling by ensuring that specific subgroups are adequately represented in the sample. This can be particularly important if we expect that the measurements could differ by subgroup.

3.4 Cluster Sample

Cluster sampling is a technique where the population is divided into separate groups, or clusters. A random sample of these clusters is then selected for analysis. In our case, we define each unique ZIP code as a cluster and then randomly select a number of these clusters. This method is advantageous when it is costly or impractical to conduct a census of the entire population. It is particularly useful when the population is spread out geographically and individual elements are not conveniently accessible.

4 Summary

In this study, we leveraged the SBA loan dataset to compare different sampling techniques and their applicability to real-world data analysis scenarios. We began by preprocessing the dataset, which involved handling missing values and reclassifying the state variable into broader geographic regions. Then, we undertook four distinct sampling strategies—simple random, systematic, stratified, and cluster sampling—each chosen for its unique benefits and alignment with our data characteristics. Simple random sampling allowed for unbiased representation, while systematic sampling provided an evenly distributed selection. Stratified sampling enhanced the accuracy by considering sub-group proportions, and cluster sampling with ZIP codes took advantage of natural geographical divisions. These methodologies, underpinned by robust statistical principles, aimed to yield insights reflective of the larger population of small business loan applications.