This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.
Next, we’re going to take a look at the variables we have, what they represent, and what type they are.
| Variable Name | Variable Type | Description |
|---|---|---|
| LoanNr_ChkDgt | numeric | Identifier (ID) Variable |
| Name | character | Borrower Name |
| City | character | Borrower City |
| State | character | Borrower State |
| Zip | integer | Borrower Zip Code |
| Bank | character | Bank Name |
| BankState | character | Bank State |
| NAICS | integer | North American Industry Classification System code |
| ApprovalDate | character | Date SBA Commitment Issued |
| ApprovalFY | character | Fiscal Year of Commitment |
| Term | integer | Loan term in months |
| NoEmp | integer | Number of Employees |
| NewExist | integer | Business Existing (= 1) or New (= 2) |
| CreateJob | integer | Number of jobs created |
| RetainedJob | integer | Number of jobs retained |
| FranchiseCode | integer | Franchise Code; 0 = Franchise, 1 = No Franchise |
| UrbanRural | integer | Urban = 1, Rural = 2, Undefined = 0 |
| RevLineCr | character | Revolving Credit; Y = Yes, N = No |
| LowDoc | character | LowDoc Loan; Y = Yes, N = No |
| ChgOffDate | character | Date loan is declared defaulted |
| DisbursementDate | character | Disbursement Date |
| DisbursementGross | character | Disbursement Amount |
| BalanceGross | character | Gross amount oustanding |
| MIS_Status | character | Loan Status; Paid off or Default |
| ChgOffPrinGr | character | Charged off Amount |
| GrAppv | character | Gross Approved Amount |
| SBA_Appv | character | SBA’s Guaranteed Amount |
We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.
Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.
The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.
loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))
kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))
| Total Number of Missing MIS_Status Values |
|---|
| 1997 |
Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed
| Final Number of Missing MIS_Status Values |
|---|
| 0 |
Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.
We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.
| Region | Count |
|---|---|
| Great Plains | 37439 |
| Mid-Atlantic | 133249 |
| Midwest | 174646 |
| New England | 69547 |
| Non-Contiguous | 6009 |
| Rocky Mountain | 60427 |
| South | 148868 |
| Southwest | 102059 |
| West Coast | 164771 |
By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.
We chose to cluster the Zip Code variable because zip codes have geographic proximity that oftentimes have similar demographics and socioeconomic status. This makes it ideal as we want to make each cluster’s loan observations as similar to each other as possible.
The method of clustering chosen was taking Zip Codes that were similar numerically to each other (i.e. the first 3 digits were identical) and grouping all Zip Codes into one cluster.
We will perform 4 types of sampling on the loan data: Simple Random Sample, Systematic Sample, Stratified Sample, and Cluster Sample using Zip Code.
In simple random sampling, every member of the population has an equal chance of being selected. We set a seed to make our results reproducible and used the sample function to randomly select 1300 observations from the new_loan data set. This type of sampling is beneficial because it ensures that every observation has the same probability of selection, which minimizes bias and makes the sample representative of the population.
Systematic sampling involves selecting members from a larger population according to a random starting point and a fixed, periodic interval. This technique ensures that the population is evenly sampled and is often simpler and more straightforward than simple random sampling. It’s particularly useful when a complete list of all members of the population is available. In this case, we calculate the interval by dividing the population size by the desired sample size, choose a random start within the first interval, and then select every nth observation thereafter.
Stratified sampling is a method where the population is divided into homogeneous subgroups, known as strata, and random samples are taken from each stratum. The strata_sizes vector is used to ensure that the sample size for each stratum is proportional to the stratum’s size within the population. This method can provide greater precision than simple random sampling by ensuring that specific subgroups are adequately represented in the sample. This can be particularly important if we expect that the measurements could differ by subgroup.
Cluster sampling is a technique where the population is divided into separate groups, or clusters. A random sample of these clusters is then selected for analysis. In our case, we define each unique ZIP code as a cluster and then randomly select a number of these clusters. This method is advantageous when it is costly or impractical to conduct a census of the entire population. It is particularly useful when the population is spread out geographically and individual elements are not conveniently accessible.
In this study, we leveraged the SBA loan dataset to compare different sampling techniques and their applicability to real-world data analysis scenarios. We began by preprocessing the dataset, which involved handling missing values and reclassifying the state variable into broader geographic regions. Then, we undertook four distinct sampling strategies—simple random, systematic, stratified, and cluster sampling—each chosen for its unique benefits and alignment with our data characteristics. Simple random sampling allowed for unbiased representation, while systematic sampling provided an evenly distributed selection. Stratified sampling enhanced the accuracy by considering sub-group proportions, and cluster sampling with ZIP codes took advantage of natural geographical divisions. These methodologies, underpinned by robust statistical principles, aimed to yield insights reflective of the larger population of small business loan applications.