Data Set Description
The name of the data set is National SBA (Small Business
Administration). It is described as census type data. It was contained
in 9 csv files eight of which held 100,000 observations each, with the
ninth holding 99164 observations. the data sets were combined into one
data set called loan, with 899,164 observations total. There are 28
variables in loan. the source of the data set is expressed to be from
the United States Small Business Administration. A description of the
data set, taken from its origin, states that “this data set is…provides
historical data from 1987 through 2014…[for loans]…that was guaranteed
to some degree by the SBA. Included is a variable [MIS_Status] which
indicates if the loan was paid in full or defaulted/charged off. Below
should be a printout of the structure of the data set.
'data.frame': 899164 obs. of 28 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ LoanNr_ChkDgt : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
$ Name : chr "ABC HOBBYCRAFT" "LANDMARK BAR & GRILLE (THE)" "WHITLOCK DDS, TODD M." "BIG BUCKS PAWN & JEWELRY, LLC" ...
$ City : chr "EVANSVILLE" "NEW PARIS" "BLOOMINGTON" "BROKEN ARROW" ...
$ State : chr "IN" "IN" "IN" "OK" ...
$ Zip : int 47711 46526 47401 74012 32801 6062 7083 34491 32456 6073 ...
$ Bank : chr "FIFTH THIRD BANK" "1ST SOURCE BANK" "GRANT COUNTY STATE BANK" "1ST NATL BK & TR CO OF BROKEN" ...
$ BankState : chr "OH" "IN" "IN" "OK" ...
$ NAICS : int 451120 722410 621210 0 0 332721 0 811118 721310 0 ...
$ ApprovalDate : chr "28-Feb-97" "28-Feb-97" "28-Feb-97" "28-Feb-97" ...
$ ApprovalFY : chr "1997" "1997" "1997" "1997" ...
$ Term : int 84 60 180 60 240 120 45 84 297 84 ...
$ NoEmp : int 4 2 7 2 14 19 45 1 2 3 ...
$ NewExist : int 2 2 1 1 1 1 2 2 2 2 ...
$ CreateJob : int 0 0 0 0 7 0 0 0 0 0 ...
$ RetainedJob : int 0 0 0 0 7 0 0 0 0 0 ...
$ FranchiseCode : int 1 1 1 1 1 1 0 1 1 1 ...
$ UrbanRural : int 0 0 0 0 0 0 0 0 0 0 ...
$ RevLineCr : chr "N" "N" "N" "N" ...
$ LowDoc : chr "Y" "Y" "N" "Y" ...
$ ChgOffDate : chr "" "" "" "" ...
$ DisbursementDate : chr "28-Feb-99" "31-May-97" "31-Dec-97" "30-Jun-97" ...
$ DisbursementGross: chr "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
$ BalanceGross : chr "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
$ MIS_Status : chr "P I F" "P I F" "P I F" "P I F" ...
$ ChgOffPrinGr : chr "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
$ GrAppv : chr "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
$ SBA_Appv : chr "$48,000.00 " "$32,000.00 " "$215,250.00 " "$28,000.00 " ...
Exploratory Data Analysis
Delete all records whose MIS_Status value was missing
Bellow are two tables that contain the proportions and frequencies of
the values in the MIS_Status variable in the data set. The MIS_Status
has been expressed to indicate default status. In the first table, it
may be seen that there are three types of values for the variable:
CHGOFF, PIF, and an an empty character value. The empty character value
is assumed to represent missing values. It represents .0022 of the
values for MIS_Status. the second table demonstrates the same
information as the first table, except the missing values has been
remove. Notice that the frequencies of CHGOFF and PIF do not change
however their proportions do.
MIS_Status Before
|
1997 |
0.0022210 |
| CHGOFF |
157558 |
0.1752272 |
| P I F |
739609 |
0.8225518 |
MIS_Status After
| CHGOFF |
157558 |
0.1756172 |
| P I F |
739609 |
0.8243828 |
Choose a categorical variable and combine its sparse
categories.
Below is a proportion/frequency table for the North America Industry
Classification System (NAICS) Codes. Above that table is table of some
the values for the NAICS variable. It may be clear that these values
vary from one another. However, the system codifies them by their first
two digits. The proportion/frequency table holds proportions and
frequencies of those codes after the values within the NAICS variable
were combined with the NAICS codification.
NAICS PRE-Combination
| GOMEZ’S TAQUITOS EXPRESS |
0 |
| EUROPEAN TRADITIONS |
442110 |
| AUTOCHEM INC |
422690 |
| Andrews Chiropractic, P.C. |
621310 |
| PRIMARY PREP OF ORLANDO INC |
624410 |
| Andrzej Baksik |
236118 |
| Timbuktu’s Bar & Grill |
722110 |
| PRESTIGE COFFEE SERVICE |
0 |
| RKR Transpotation, Inc. |
484110 |
| TRUFFLES GRILLE & WINE BAR LLC |
722110 |
NAICS PROP/FREQ
| 0 |
201667 |
0.2247820 |
| 11 |
8995 |
0.0100260 |
| 21 |
1851 |
0.0020632 |
| 22 |
662 |
0.0007379 |
| 23 |
66492 |
0.0741133 |
| 31-33 |
67903 |
0.0756860 |
| 42 |
48673 |
0.0542519 |
| 44-45 |
126975 |
0.1415288 |
| 48-49 |
22408 |
0.0249764 |
| 51 |
11362 |
0.0126643 |
| 52 |
9470 |
0.0105554 |
| 53 |
13588 |
0.0151455 |
| 54 |
67922 |
0.0757072 |
| 55 |
256 |
0.0002853 |
| 56 |
32529 |
0.0362575 |
| 61 |
6401 |
0.0071347 |
| 62 |
55264 |
0.0615983 |
| 71 |
14616 |
0.0162913 |
| 72 |
67511 |
0.0752491 |
| 81 |
72395 |
0.0806929 |
| 92 |
227 |
0.0002530 |
NAICS PRE-Combination
| GOMEZ’S TAQUITOS EXPRESS |
0 |
| EUROPEAN TRADITIONS |
44-45 |
| AUTOCHEM INC |
42 |
| Andrews Chiropractic, P.C. |
62 |
| PRIMARY PREP OF ORLANDO INC |
62 |
| Andrzej Baksik |
23 |
| Timbuktu’s Bar & Grill |
72 |
| PRESTIGE COFFEE SERVICE |
0 |
| RKR Transpotation, Inc. |
48-49 |
| TRUFFLES GRILLE & WINE BAR LLC |
72 |
Calculate the default rates of the categorical
variable.
Below is table of the default rates for the different industries that
recieved loans. NAICS represents their codifications the status CHGOFF
indicates default and The rate may be gleaned on the far right.
Default Rates by NAICS
| 0 |
CHGOFF |
16799 |
201667 |
0.0833 |
| 11 |
CHGOFF |
812 |
8995 |
0.0903 |
| 21 |
CHGOFF |
157 |
1851 |
0.0848 |
| 22 |
CHGOFF |
94 |
662 |
0.1420 |
| 23 |
CHGOFF |
15463 |
66492 |
0.2326 |
| 31-33 |
CHGOFF |
10438 |
67903 |
0.1537 |
| 42 |
CHGOFF |
9480 |
48673 |
0.1948 |
| 44-45 |
CHGOFF |
28868 |
126975 |
0.2274 |
| 48-49 |
CHGOFF |
5939 |
22408 |
0.2650 |
| 51 |
CHGOFF |
2821 |
11362 |
0.2483 |
| 52 |
CHGOFF |
2692 |
9470 |
0.2843 |
| 53 |
CHGOFF |
3904 |
13588 |
0.2873 |
| 54 |
CHGOFF |
12957 |
67922 |
0.1908 |
| 55 |
CHGOFF |
26 |
256 |
0.1016 |
| 56 |
CHGOFF |
7661 |
32529 |
0.2355 |
| 61 |
CHGOFF |
1552 |
6401 |
0.2425 |
| 62 |
CHGOFF |
5736 |
55264 |
0.1038 |
| 71 |
CHGOFF |
3013 |
14616 |
0.2061 |
| 72 |
CHGOFF |
14882 |
67511 |
0.2204 |
| 81 |
CHGOFF |
14229 |
72395 |
0.1965 |
| 92 |
CHGOFF |
35 |
227 |
0.1542 |
Discretize GrAppv into 5 categories
Below is a proportion/frequency table for the variable discretization
of GrAppv. Their are five categories that run from 0 to 4. Most of the
values in GrAppv fell within the first category.
GrAppv PROP/FREQ
| 0 |
879282 |
0.9800650 |
| 1 |
17251 |
0.0192283 |
| 2 |
427 |
0.0004759 |
| 3 |
138 |
0.0001538 |
| 4 |
69 |
0.0000769 |
Draw the density curves of SBA_Appv for each of the 5
subpopulations
Below are density plots created using SBA_Appv for each of the
SPGrAppv categories. On the Y-axis are the proportions for the X-axis
dollar values of Guranteed Amount of Approved loans for each Group. The
X-axis fruns from 0 to 6 million dollars
