1 Introduction

This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.

1.1 Variable Inspection

Next, we’re going to take a look at the variables we have, what they represent, and what type they are.

Variable Name Variable Type Description
LoanNr_ChkDgt numeric Identifier (ID) Variable
Name character Borrower Name
City character Borrower City
State character Borrower State
Zip integer Borrower Zip Code
Bank character Bank Name
BankState character Bank State
NAICS integer North American Industry Classification System code
ApprovalDate character Date SBA Commitment Issued
ApprovalFY character Fiscal Year of Commitment
Term integer Loan term in months
NoEmp integer Number of Employees
NewExist integer Business Existing (= 1) or New (= 2)
CreateJob integer Number of jobs created
RetainedJob integer Number of jobs retained
FranchiseCode integer Franchise Code; 0 = Franchise, 1 = No Franchise
UrbanRural integer Urban = 1, Rural = 2, Undefined = 0
RevLineCr character Revolving Credit; Y = Yes, N = No
LowDoc character LowDoc Loan; Y = Yes, N = No
ChgOffDate character Date loan is declared defaulted
DisbursementDate character Disbursement Date
DisbursementGross character Disbursement Amount
BalanceGross character Gross amount oustanding
MIS_Status character Loan Status; Paid off or Default
ChgOffPrinGr character Charged off Amount
GrAppv character Gross Approved Amount
SBA_Appv character SBA’s Guaranteed Amount

We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.

2 Data Preprocessing

Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.

2.1 Remove Missing Values

The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.

loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))

kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))
Total Number of Missing MIS_Status Values
1997

Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed

Final Number of Missing MIS_Status Values
0

2.2 Reformatting Currency

There are 5 variables - DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, SBA_APPv - that are displayed in currency form (i.e. including commas and dollar signs). As a result, these variables are considered character variables. We are only interested in the numerical values.

First_5.DisbursementGross First_5.BalanceGross First_5.ChargedOffPrincipal First_5.GrossApprovedAmount First_5.SBA_ApprovedAmount
$60,000.00 $0.00 $0.00 $60,000.00 $48,000.00
$40,000.00 $0.00 $0.00 $40,000.00 $32,000.00
$287,000.00 $0.00 $0.00 $287,000.00 $215,250.00
$35,000.00 $0.00 $0.00 $35,000.00 $28,000.00
$229,000.00 $0.00 $0.00 $229,000.00 $229,000.00

After reformatting these character values into numerical values by removing any commas, dollar signs, and other potential currency symbols, we have our desired form.

Changed_5.DisbursementGross Changed_5.BalanceGross Changed_5.ChargedOffPrincipal Changed_5.GrossApprovedAmount Changed_5.SBA_ApprovedAmount
60000 0 0 60000 48000
40000 0 0 40000 32000
287000 0 0 287000 215250
35000 0 0 35000 28000
229000 0 0 229000 229000

2.3 Discretizing States by Geographic Regions

Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.

US Geographic Regions
US Geographic Regions

We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.

Region Count
Great Plains 37439
Mid-Atlantic 133249
Midwest 174646
New England 69547
Non-Contiguous 6009
Rocky Mountain 60427
South 148868
Southwest 102059
West Coast 164771

By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.

3 Analysis

Next, we’ll look at how discretizing US State by Region and SBA Approved loan amounts impacts their predictive power.

3.1 Default Rate by Region

We’re calculating the default rate based on the MIS_Status showing “Account Charged Off.” When split by region, the default rates vary from 11.8% to 21.8%. From a higher-level overview there doesn’t seem to be any patterns of what general regions fair better or worse in terms of defaulting on loans, except one observation: The regions with the 3 lowest default rates - Great Plains, New England, and Non-Contiguous - have among the 3 lowest overall approved loans. This seems to demonstrate that lenders in these areas are more conservative with their loans as they most likely have a stricter acceptance policy which leads to less overall loans issued and greater chance that loans will be paid back.

Region Count Number Defaulted Default Rate
Great Plains 37439 4517 12.06496
Mid-Atlantic 133249 24761 18.58250
Midwest 174646 29090 16.65655
New England 69547 8252 11.86536
Non-Contiguous 6009 830 13.81261
Rocky Mountain 60427 9115 15.08432
South 148868 32417 21.77567
Southwest 102059 19443 19.05075
West Coast 164771 29116 17.67059

3.2 Discretizing Gross Approval

While in general it’s good practice to avoid discretizing continuous variables as information is lost by binning them, there are 2 reasons why it’s more acceptable here:

  1. If you look at the summary of the Gross Approved amounts, you’ll notice that there’s a large right skew with 75% of loans amounting to 225k USD or less but a max value of 5.47 million USD.
FALSE    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
FALSE    1000   35000   90000  193050  225000 5472000

Discretizing the variable in this case helps with this sparsity at the far right extremes.

  1. The cutoffs for each bin correspond with the type of small business loan typically seen:
  • 1k-50k USD: Microloans - Short-term loans that are generally seeked by newly established or growing small businesses
  • 50k-200k USD: SBA Express loans - Loans in this range are typically the amounts approved by regional banks and are generally used for purchasing a large asset (company or capital expenses) to be paid off over a longer period - can be upwards of 25 years.
  • 200k-500k USD: Medium-term loan: Meant to be paid back in 2-5 years. Generally used for an already established small business looking to expand further where the funds would help expand capital to buy larger inventory, supplies, open a new locations, etc.
  • 500k-1 million USD: SBA 7(a) loan: This is the most common long-term loan and is generally only approved for large fixed assets or real estate. These are usually backed by collateral and loan guarantees and are only approved for well-established small businesses.
  • 1 million+ USD: Large SBA loans: This category is for the businesses that are the most credit-worthy and those that wish to expand beyond the definitions of what a ‘small business’ means.

Plotting the density curves, we get a better understanding of the distribution and spread of our different discretized bins. The overall density curve shows that the majority of loans are in the lower range (less than 50k) demonstrating that smaller loans are far more common than larger loans. Both plots exhibit an overall right-skewed distribution, but the skew significantly decreases when you compare within each of the sub-populations.

4 Summary

We performed some exploratory data analysis and data preprocessing in order to analyze the data behind our small business loan data. Then we performed 2 discretization processes: one turning our categorical variable State into larger categories of US Geographic Regions and one on our continuous variable Gross Approved loan amount into pre-defined bins. We then analyzed its impact and distribution following the discretization.