This data set is collected from banks on small business loan applications from the Small Business Association (SBA). It contains 899164 observations, each with 27 variables.The response variable is the MIS_Status value, which will display whether the loan was paid in full or defaulted on.
Next, we’re going to take a look at the variables we have, what they represent, and what type they are.
| Variable Name | Variable Type | Description |
|---|---|---|
| LoanNr_ChkDgt | numeric | Identifier (ID) Variable |
| Name | character | Borrower Name |
| City | character | Borrower City |
| State | character | Borrower State |
| Zip | integer | Borrower Zip Code |
| Bank | character | Bank Name |
| BankState | character | Bank State |
| NAICS | integer | North American Industry Classification System code |
| ApprovalDate | character | Date SBA Commitment Issued |
| ApprovalFY | character | Fiscal Year of Commitment |
| Term | integer | Loan term in months |
| NoEmp | integer | Number of Employees |
| NewExist | integer | Business Existing (= 1) or New (= 2) |
| CreateJob | integer | Number of jobs created |
| RetainedJob | integer | Number of jobs retained |
| FranchiseCode | integer | Franchise Code; 0 = Franchise, 1 = No Franchise |
| UrbanRural | integer | Urban = 1, Rural = 2, Undefined = 0 |
| RevLineCr | character | Revolving Credit; Y = Yes, N = No |
| LowDoc | character | LowDoc Loan; Y = Yes, N = No |
| ChgOffDate | character | Date loan is declared defaulted |
| DisbursementDate | character | Disbursement Date |
| DisbursementGross | character | Disbursement Amount |
| BalanceGross | character | Gross amount oustanding |
| MIS_Status | character | Loan Status; Paid off or Default |
| ChgOffPrinGr | character | Charged off Amount |
| GrAppv | character | Gross Approved Amount |
| SBA_Appv | character | SBA’s Guaranteed Amount |
We have a wide variety of explanatory variables for each of our observations. Socioeconomic variables about the borrower, geographic variables that explain region, personal identifiers, loan terms, and important information about the small business.
Just from a cursory look of our table, we must perform some preprocessing on the data as well as reformatting certain variables.
The first step is to remove variables where the MIS_Status value is nonexistent. This is important because MIS_Status is our response variable, so we can’t perform an analysis on those observations that don’t have an outcome or response.
loan$MIS_Status[loan$MIS_Status == ""] <- NA
sumdata <- sum(is.na(loan$MIS_Status))
kable(sumdata, format = "markdown", col.names = c("Total Number of Missing MIS_Status Values"))
| Total Number of Missing MIS_Status Values |
|---|
| 1997 |
Blank values for MIS_Status variables must be removed. We first turned these values into NA values and then tabulated the total number of missing MIS_Status values - which comes out to 1997 observations without a MIS_Status value. We proceed to remove the entire observation and double check that all missing values have been removed
| Final Number of Missing MIS_Status Values |
|---|
| 0 |
There are 5 variables - DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, SBA_APPv - that are displayed in currency form (i.e. including commas and dollar signs). As a result, these variables are considered character variables. We are only interested in the numerical values.
| First_5.DisbursementGross | First_5.BalanceGross | First_5.ChargedOffPrincipal | First_5.GrossApprovedAmount | First_5.SBA_ApprovedAmount |
|---|---|---|---|---|
| $60,000.00 | $0.00 | $0.00 | $60,000.00 | $48,000.00 |
| $40,000.00 | $0.00 | $0.00 | $40,000.00 | $32,000.00 |
| $287,000.00 | $0.00 | $0.00 | $287,000.00 | $215,250.00 |
| $35,000.00 | $0.00 | $0.00 | $35,000.00 | $28,000.00 |
| $229,000.00 | $0.00 | $0.00 | $229,000.00 | $229,000.00 |
After reformatting these character values into numerical values by removing any commas, dollar signs, and other potential currency symbols, we have our desired form.
| Changed_5.DisbursementGross | Changed_5.BalanceGross | Changed_5.ChargedOffPrincipal | Changed_5.GrossApprovedAmount | Changed_5.SBA_ApprovedAmount |
|---|---|---|---|---|
| 60000 | 0 | 0 | 60000 | 48000 |
| 40000 | 0 | 0 | 40000 | 32000 |
| 287000 | 0 | 0 | 287000 | 215250 |
| 35000 | 0 | 0 | 35000 | 28000 |
| 229000 | 0 | 0 | 229000 | 229000 |
Originally we have 51 unique values for every state and the District of Columbia as their 2 letter abbreviations. We omitted any observations that did not have a 2-letter abbreviation. We then used the following map to classify each state by its geographic region.
We convert the respective states to their corresponding regions - DC was converted to Mid-Atlantic.
| Region | Count |
|---|---|
| Great Plains | 37439 |
| Mid-Atlantic | 133249 |
| Midwest | 174646 |
| New England | 69547 |
| Non-Contiguous | 6009 |
| Rocky Mountain | 60427 |
| South | 148868 |
| Southwest | 102059 |
| West Coast | 164771 |
By analyzing the counts, we get a better idea of how our loan data is distributed by US geographic regions. One thing that is potentially problematic is having only 6009 observations for one of our Non-Contiguous category - comprising of Alaska and Hawaii. While this category makes up around less than 1% of the total data, it has practical significance in viewing how Alaska and Hawaii compare to the 48 contiguous states. As a result, we will leave it in the data despite potential concerns of sparsity.
Next, we’ll look at how discretizing US State by Region and SBA Approved loan amounts impacts their predictive power.
We’re calculating the default rate based on the MIS_Status showing “Account Charged Off.” When split by region, the default rates vary from 11.8% to 21.8%. From a higher-level overview there doesn’t seem to be any patterns of what general regions fair better or worse in terms of defaulting on loans, except one observation: The regions with the 3 lowest default rates - Great Plains, New England, and Non-Contiguous - have among the 3 lowest overall approved loans. This seems to demonstrate that lenders in these areas are more conservative with their loans as they most likely have a stricter acceptance policy which leads to less overall loans issued and greater chance that loans will be paid back.
| Region | Count | Number Defaulted | Default Rate |
|---|---|---|---|
| Great Plains | 37439 | 4517 | 12.06496 |
| Mid-Atlantic | 133249 | 24761 | 18.58250 |
| Midwest | 174646 | 29090 | 16.65655 |
| New England | 69547 | 8252 | 11.86536 |
| Non-Contiguous | 6009 | 830 | 13.81261 |
| Rocky Mountain | 60427 | 9115 | 15.08432 |
| South | 148868 | 32417 | 21.77567 |
| Southwest | 102059 | 19443 | 19.05075 |
| West Coast | 164771 | 29116 | 17.67059 |
While in general it’s good practice to avoid discretizing continuous variables as information is lost by binning them, there are 2 reasons why it’s more acceptable here:
FALSE Min. 1st Qu. Median Mean 3rd Qu. Max.
FALSE 1000 35000 90000 193050 225000 5472000
Discretizing the variable in this case helps with this sparsity at the far right extremes.
Plotting the density curves, we get a better understanding of the distribution and spread of our different discretized bins. The overall density curve shows that the majority of loans are in the lower range (less than 50k) demonstrating that smaller loans are far more common than larger loans. Both plots exhibit an overall right-skewed distribution, but the skew significantly decreases when you compare within each of the sub-populations.
We performed some exploratory data analysis and data preprocessing in order to analyze the data behind our small business loan data. Then we performed 2 discretization processes: one turning our categorical variable State into larger categories of US Geographic Regions and one on our continuous variable Gross Approved loan amount into pre-defined bins. We then analyzed its impact and distribution following the discretization.