1 Introduction

This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.

1.1 Variable Inspection

Next, we’re going to take a look at the variables we have, what they represent, and what type they are.

Variable Name	Variable Type	Description
LoanNr_ChkDgt	numeric	Identifier (ID) Variable
Name	character	Borrower Name
City	character	Borrower City
State	character	Borrower State
Zip	integer	Borrower Zip Code
Bank	character	Bank Name
BankState	character	Bank State
NAICS	integer	North American Industry Classification System code
ApprovalDate	character	Date SBA Commitment Issued
ApprovalFY	character	Fiscal Year of Commitment
Term	integer	Loan term in months
NoEmp	integer	Number of Employees
NewExist	integer	Business Existing (= 1) or New (= 2)
CreateJob	integer	Number of jobs created
RetainedJob	integer	Number of jobs retained
FranchiseCode	integer	Franchise Code; 0 = Franchise, 1 = No Franchise
UrbanRural	integer	Urban = 1, Rural = 2, Undefined = 0
RevLineCr	character	Revolving Credit; Y = Yes, N = No
LowDoc	character	LowDoc Loan; Y = Yes, N = No
ChgOffDate	character	Date loan is declared defaulted
DisbursementDate	character	Disbursement Date
DisbursementGross	character	Disbursement Amount
BalanceGross	character	Gross amount oustanding
MIS_Status	character	Loan Status; Paid off or Default
ChgOffPrinGr	character	Charged off Amount
GrAppv	character	Gross Approved Amount
SBA_Appv	character	SBA’s Guaranteed Amount

We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.

2 Data Preprocessing

Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.

2.1 Remove Missing Values

The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.

loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))

kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))

Total Number of Missing MIS_Status Values
1997

Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed

Final Number of Missing MIS_Status Values
0

2.2 Reformatting Currency

There are 5 variables - DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, SBA_APPv - that are displayed in currency form (i.e. including commas and dollar signs). As a result, these variables are considered character variables. We are only interested in the numerical values.

First_5.DisbursementGross	First_5.BalanceGross	First_5.ChargedOffPrincipal	First_5.GrossApprovedAmount	First_5.SBA_ApprovedAmount
$60,000.00	$0.00	$0.00	$60,000.00	$48,000.00
$40,000.00	$0.00	$0.00	$40,000.00	$32,000.00
$287,000.00	$0.00	$0.00	$287,000.00	$215,250.00
$35,000.00	$0.00	$0.00	$35,000.00	$28,000.00
$229,000.00	$0.00	$0.00	$229,000.00	$229,000.00

After reformatting these character values into numerical values by removing any commas, dollar signs, and other potential currency symbols, we have our desired form.

Changed_5.DisbursementGross	Changed_5.GrossApprovedAmount	Changed_5.SBA_ApprovedAmount
60000	60000	48000
40000	40000	32000
287000	287000	215250
35000	35000	28000
229000	229000	229000

2.3 Discretizing States by Geographic Regions

Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.

US Geographic Regions

We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.

Region	Count
Great Plains	37439
Mid-Atlantic	133249
Midwest	174646
New England	69547
Non-Contiguous	6009
Rocky Mountain	60427
South	148868
Southwest	102059
West Coast	164771

By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.

3 Analysis

Next, we’ll look at how discretizing US State by Region and SBA Approved loan amounts impacts their predictive power.

3.1 Default Rate by Region

We’re calculating the default rate based on the MIS_Status showing “Account Charged Off.” When split by region, the default rates vary from 11.8% to 21.8%. From a higher-level overview there doesn’t seem to be any patterns of what general regions fair better or worse in terms of defaulting on loans, except one observation: The regions with the 3 lowest default rates - Great Plains, New England, and Non-Contiguous - have among the 3 lowest overall approved loans. This seems to demonstrate that lenders in these areas are more conservative with their loans as they most likely have a stricter acceptance policy which leads to less overall loans issued and greater chance that loans will be paid back.

Region	Count	Number Defaulted	Default Rate
Great Plains	37439	4517	12.06496
Mid-Atlantic	133249	24761	18.58250
Midwest	174646	29090	16.65655
New England	69547	8252	11.86536
Non-Contiguous	6009	830	13.81261
Rocky Mountain	60427	9115	15.08432
South	148868	32417	21.77567
Southwest	102059	19443	19.05075
West Coast	164771	29116	17.67059

3.2 Discretizing Gross Approval

While in general it’s good practice to avoid discretizing continuous variables as information is lost by binning them, there are 2 reasons why it’s more acceptable here:

If you look at the summary of the Gross Approved amounts, you’ll notice that there’s a large right skew with 75% of loans amounting to 225k USD or less but a max value of 5.47 million USD.

FALSE    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
FALSE    1000   35000   90000  193050  225000 5472000

Discretizing the variable in this case helps with this sparsity at the far right extremes.

The cutoffs for each bin correspond with the type of small business loan typically seen:

1k-50k USD: Microloans - Short-term loans that are generally seeked by newly established or growing small businesses
50k-200k USD: SBA Express loans - Loans in this range are typically the amounts approved by regional banks and are generally used for purchasing a large asset (company or capital expenses) to be paid off over a longer period - can be upwards of 25 years.
200k-500k USD: Medium-term loan: Meant to be paid back in 2-5 years. Generally used for an already established small business looking to expand further where the funds would help expand capital to buy larger inventory, supplies, open a new locations, etc.
500k-1 million USD: SBA 7(a) loan: This is the most common long-term loan and is generally only approved for large fixed assets or real estate. These are usually backed by collateral and loan guarantees and are only approved for well-established small businesses.
1 million+ USD: Large SBA loans: This category is for the businesses that are the most credit-worthy and those that wish to expand beyond the definitions of what a ‘small business’ means.

Plotting the density curves, we get a better understanding of the distribution and spread of our different discretized bins. The overall density curve shows that the majority of loans are in the lower range (less than 50k) demonstrating that smaller loans are far more common than larger loans. Both plots exhibit an overall right-skewed distribution, but the skew significantly decreases when you compare within each of the sub-populations.

4 Summary

We performed some exploratory data analysis and data preprocessing in order to analyze the data behind our small business loan data. Then we performed 2 discretization processes: one turning our categorical variable State into larger categories of US Geographic Regions and one on our continuous variable Gross Approved loan amount into pre-defined bins. We then analyzed its impact and distribution following the discretization.

Exploratory Data Analysis and Data Preprocessing of Small Bank Loan Data

Joshua Zhong

03/08/2024