Data Set Description

The name of the data set is National SBA (Small Business Administration). It is described as census type data. It was contained in 9 csv files eight of which held 100,000 observations each, with the ninth holding 99164 observations. the data sets were combined into one data set called loan, with 899,164 observations total. There are 28 variables in loan. the source of the data set is expressed to be from the United States Small Business Administration. A description of the data set, taken from its origin, states that “this data set is…provides historical data from 1987 through 2014…[for loans]…that was guaranteed to some degree by the SBA. Included is a variable [MIS_Status] which indicates if the loan was paid in full or defaulted/charged off. Below should be a printout of the structure of the data set.

'data.frame':   899164 obs. of  28 variables:
 $ X                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ LoanNr_ChkDgt    : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
 $ Name             : chr  "ABC HOBBYCRAFT" "LANDMARK BAR & GRILLE (THE)" "WHITLOCK DDS, TODD M." "BIG BUCKS PAWN & JEWELRY, LLC" ...
 $ City             : chr  "EVANSVILLE" "NEW PARIS" "BLOOMINGTON" "BROKEN ARROW" ...
 $ State            : chr  "IN" "IN" "IN" "OK" ...
 $ Zip              : int  47711 46526 47401 74012 32801 6062 7083 34491 32456 6073 ...
 $ Bank             : chr  "FIFTH THIRD BANK" "1ST SOURCE BANK" "GRANT COUNTY STATE BANK" "1ST NATL BK & TR CO OF BROKEN" ...
 $ BankState        : chr  "OH" "IN" "IN" "OK" ...
 $ NAICS            : int  451120 722410 621210 0 0 332721 0 811118 721310 0 ...
 $ ApprovalDate     : chr  "28-Feb-97" "28-Feb-97" "28-Feb-97" "28-Feb-97" ...
 $ ApprovalFY       : chr  "1997" "1997" "1997" "1997" ...
 $ Term             : int  84 60 180 60 240 120 45 84 297 84 ...
 $ NoEmp            : int  4 2 7 2 14 19 45 1 2 3 ...
 $ NewExist         : int  2 2 1 1 1 1 2 2 2 2 ...
 $ CreateJob        : int  0 0 0 0 7 0 0 0 0 0 ...
 $ RetainedJob      : int  0 0 0 0 7 0 0 0 0 0 ...
 $ FranchiseCode    : int  1 1 1 1 1 1 0 1 1 1 ...
 $ UrbanRural       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RevLineCr        : chr  "N" "N" "N" "N" ...
 $ LowDoc           : chr  "Y" "Y" "N" "Y" ...
 $ ChgOffDate       : chr  "" "" "" "" ...
 $ DisbursementDate : chr  "28-Feb-99" "31-May-97" "31-Dec-97" "30-Jun-97" ...
 $ DisbursementGross: chr  "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
 $ BalanceGross     : chr  "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
 $ MIS_Status       : chr  "P I F" "P I F" "P I F" "P I F" ...
 $ ChgOffPrinGr     : chr  "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
 $ GrAppv           : chr  "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
 $ SBA_Appv         : chr  "$48,000.00 " "$32,000.00 " "$215,250.00 " "$28,000.00 " ...

Exploratory Data Analysis

Delete all records whose MIS_Status value was missing

Bellow are two tables that contain the proportions and frequencies of the values in the MIS_Status variable in the data set. The MIS_Status has been expressed to indicate default status. In the first table, it may be seen that there are three types of values for the variable: CHGOFF, PIF, and an an empty character value. The empty character value is assumed to represent missing values. It represents .0022 of the values for MIS_Status. the second table demonstrates the same information as the first table, except the missing values has been remove. Notice that the frequencies of CHGOFF and PIF do not change however their proportions do.

MIS_Status Before
Value FREQ PROP
1997 0.0022210
CHGOFF 157558 0.1752272
P I F 739609 0.8225518
MIS_Status After
Value FREQ PROP
CHGOFF 157558 0.1756172
P I F 739609 0.8243828

Choose a categorical variable and combine its sparse categories.

Below is a proportion/frequency table for the North America Industry Classification System (NAICS) Codes. Above that table is table of some the values for the NAICS variable. It may be clear that these values vary from one another. However, the system codifies them by their first two digits. The proportion/frequency table holds proportions and frequencies of those codes after the values within the NAICS variable were combined with the NAICS codification.

NAICS PRE-Combination
Name NAICS
GOMEZ’S TAQUITOS EXPRESS 0
EUROPEAN TRADITIONS 442110
AUTOCHEM INC 422690
Andrews Chiropractic, P.C. 621310
PRIMARY PREP OF ORLANDO INC 624410
Andrzej Baksik 236118
Timbuktu’s Bar & Grill 722110
PRESTIGE COFFEE SERVICE 0
RKR Transpotation, Inc. 484110
TRUFFLES GRILLE & WINE BAR LLC 722110
NAICS PROP/FREQ
Value FREQ PROP
0 201667 0.2247820
11 8995 0.0100260
21 1851 0.0020632
22 662 0.0007379
23 66492 0.0741133
31-33 67903 0.0756860
42 48673 0.0542519
44-45 126975 0.1415288
48-49 22408 0.0249764
51 11362 0.0126643
52 9470 0.0105554
53 13588 0.0151455
54 67922 0.0757072
55 256 0.0002853
56 32529 0.0362575
61 6401 0.0071347
62 55264 0.0615983
71 14616 0.0162913
72 67511 0.0752491
81 72395 0.0806929
92 227 0.0002530
NAICS PRE-Combination
Name NAICS
GOMEZ’S TAQUITOS EXPRESS 0
EUROPEAN TRADITIONS 44-45
AUTOCHEM INC 42
Andrews Chiropractic, P.C. 62
PRIMARY PREP OF ORLANDO INC 62
Andrzej Baksik 23
Timbuktu’s Bar & Grill 72
PRESTIGE COFFEE SERVICE 0
RKR Transpotation, Inc. 48-49
TRUFFLES GRILLE & WINE BAR LLC 72

Calculate the default rates of the categorical variable.

Below is table of the default rates for the different industries that recieved loans. NAICS represents their codifications the status CHGOFF indicates default and The rate may be gleaned on the far right.

Default Rates by NAICS
NAICS Status Count Total Rate
0 CHGOFF 16799 201667 0.0833
11 CHGOFF 812 8995 0.0903
21 CHGOFF 157 1851 0.0848
22 CHGOFF 94 662 0.1420
23 CHGOFF 15463 66492 0.2326
31-33 CHGOFF 10438 67903 0.1537
42 CHGOFF 9480 48673 0.1948
44-45 CHGOFF 28868 126975 0.2274
48-49 CHGOFF 5939 22408 0.2650
51 CHGOFF 2821 11362 0.2483
52 CHGOFF 2692 9470 0.2843
53 CHGOFF 3904 13588 0.2873
54 CHGOFF 12957 67922 0.1908
55 CHGOFF 26 256 0.1016
56 CHGOFF 7661 32529 0.2355
61 CHGOFF 1552 6401 0.2425
62 CHGOFF 5736 55264 0.1038
71 CHGOFF 3013 14616 0.2061
72 CHGOFF 14882 67511 0.2204
81 CHGOFF 14229 72395 0.1965
92 CHGOFF 35 227 0.1542

Discretize GrAppv into 5 categories

Below is a proportion/frequency table for the variable discretization of GrAppv. Their are five categories that run from 0 to 4. Most of the values in GrAppv fell within the first category.

GrAppv PROP/FREQ
Value FREQ PROP
0 879282 0.9800650
1 17251 0.0192283
2 427 0.0004759
3 138 0.0001538
4 69 0.0000769

Draw the density curves of SBA_Appv for each of the 5 subpopulations

Below are density plots created using SBA_Appv for each of the SPGrAppv categories. On the Y-axis are the proportions for the X-axis dollar values of Guranteed Amount of Approved loans for each Group. The X-axis fruns from 0 to 6 million dollars