Data Set Description
The name of the data set is National SBA (Small Business
Administration). It is described as census type data. It was contained
in 9 csv files eight of which held 100,000 observations each, with the
ninth holding 99164 observations. the data sets were combined into one
data set called loan, with 899,164 observations total. There are 28
variables in loan. the source of the data set is expressed to be from
the United States Small Business Administration. A description of the
data set, taken from its origin, states that “this data set is…provides
historical data from 1987 through 2014…[for loans]…that was guaranteed
to some degree by the SBA. Included is a variable [MIS_Status] which
indicates if the loan was paid in full or defaulted/charged off. Below
should be a printout of the structure of the data set.
'data.frame': 899164 obs. of 28 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ LoanNr_ChkDgt : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
$ Name : chr "ABC HOBBYCRAFT" "LANDMARK BAR & GRILLE (THE)" "WHITLOCK DDS, TODD M." "BIG BUCKS PAWN & JEWELRY, LLC" ...
$ City : chr "EVANSVILLE" "NEW PARIS" "BLOOMINGTON" "BROKEN ARROW" ...
$ State : chr "IN" "IN" "IN" "OK" ...
$ Zip : int 47711 46526 47401 74012 32801 6062 7083 34491 32456 6073 ...
$ Bank : chr "FIFTH THIRD BANK" "1ST SOURCE BANK" "GRANT COUNTY STATE BANK" "1ST NATL BK & TR CO OF BROKEN" ...
$ BankState : chr "OH" "IN" "IN" "OK" ...
$ NAICS : int 451120 722410 621210 0 0 332721 0 811118 721310 0 ...
$ ApprovalDate : chr "28-Feb-97" "28-Feb-97" "28-Feb-97" "28-Feb-97" ...
$ ApprovalFY : chr "1997" "1997" "1997" "1997" ...
$ Term : int 84 60 180 60 240 120 45 84 297 84 ...
$ NoEmp : int 4 2 7 2 14 19 45 1 2 3 ...
$ NewExist : int 2 2 1 1 1 1 2 2 2 2 ...
$ CreateJob : int 0 0 0 0 7 0 0 0 0 0 ...
$ RetainedJob : int 0 0 0 0 7 0 0 0 0 0 ...
$ FranchiseCode : int 1 1 1 1 1 1 0 1 1 1 ...
$ UrbanRural : int 0 0 0 0 0 0 0 0 0 0 ...
$ RevLineCr : chr "N" "N" "N" "N" ...
$ LowDoc : chr "Y" "Y" "N" "Y" ...
$ ChgOffDate : chr "" "" "" "" ...
$ DisbursementDate : chr "28-Feb-99" "31-May-97" "31-Dec-97" "30-Jun-97" ...
$ DisbursementGross: chr "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
$ BalanceGross : chr "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
$ MIS_Status : chr "P I F" "P I F" "P I F" "P I F" ...
$ ChgOffPrinGr : chr "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
$ GrAppv : chr "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
$ SBA_Appv : chr "$48,000.00 " "$32,000.00 " "$215,250.00 " "$28,000.00 " ...
Exploratory Data Analysis
Delete all records whose MIS_Status value was missing
Bellow are two tables that contain the proportions and frequencies of
the values in the MIS_Status variable in the data set. The MIS_Status
has been expressed to indicate default status. In the first table, it
may be seen that there are three types of values for the variable:
CHGOFF, PIF, and an an empty character value. The empty character value
is assumed to represent missing values. It represents .0022 of the
values for MIS_Status. the second table demonstrates the same
information as the first table, except the missing values has been
remove. Notice that the frequencies of CHGOFF and PIF do not change
however their proportions do.
MIS_Status Before
|
1997 |
0.0022210 |
| CHGOFF |
157558 |
0.1752272 |
| P I F |
739609 |
0.8225518 |
MIS_Status After
| CHGOFF |
157558 |
0.1756172 |
| P I F |
739609 |
0.8243828 |
Choose a categorical variable and combine its sparse
categories.
Below is a proportion/frequency table for the North America Industry
Classification System (NAICS) Codes. Above that table is table of some
the values for the NAICS variable. It may be clear that these values
vary from one another. However, the system codifies them by their first
two digits. The proportion/frequency table holds proportions and
frequencies of those codes after the values within the NAICS variable
were combined with the NAICS codification.
NAICS PRE-Combination
| GOMEZ’S TAQUITOS EXPRESS |
0 |
| EUROPEAN TRADITIONS |
442110 |
| AUTOCHEM INC |
422690 |
| Andrews Chiropractic, P.C. |
621310 |
| PRIMARY PREP OF ORLANDO INC |
624410 |
| Andrzej Baksik |
236118 |
| Timbuktu’s Bar & Grill |
722110 |
| PRESTIGE COFFEE SERVICE |
0 |
| RKR Transpotation, Inc. |
484110 |
| TRUFFLES GRILLE & WINE BAR LLC |
722110 |
NAICS PROP/FREQ
| 0 |
201667 |
0.2247820 |
| 11 |
8995 |
0.0100260 |
| 21 |
1851 |
0.0020632 |
| 22 |
662 |
0.0007379 |
| 23 |
66492 |
0.0741133 |
| 31-33 |
67903 |
0.0756860 |
| 42 |
48673 |
0.0542519 |
| 44-45 |
126975 |
0.1415288 |
| 48-49 |
22408 |
0.0249764 |
| 51 |
11362 |
0.0126643 |
| 52 |
9470 |
0.0105554 |
| 53 |
13588 |
0.0151455 |
| 54 |
67922 |
0.0757072 |
| 55 |
256 |
0.0002853 |
| 56 |
32529 |
0.0362575 |
| 61 |
6401 |
0.0071347 |
| 62 |
55264 |
0.0615983 |
| 71 |
14616 |
0.0162913 |
| 72 |
67511 |
0.0752491 |
| 81 |
72395 |
0.0806929 |
| 92 |
227 |
0.0002530 |
NAICS PRE-Combination
| GOMEZ’S TAQUITOS EXPRESS |
0 |
| EUROPEAN TRADITIONS |
44-45 |
| AUTOCHEM INC |
42 |
| Andrews Chiropractic, P.C. |
62 |
| PRIMARY PREP OF ORLANDO INC |
62 |
| Andrzej Baksik |
23 |
| Timbuktu’s Bar & Grill |
72 |
| PRESTIGE COFFEE SERVICE |
0 |
| RKR Transpotation, Inc. |
48-49 |
| TRUFFLES GRILLE & WINE BAR LLC |
72 |
Random Sampling
Next random sampling was conducted on the data set. Four procedures
were utilized: Simple randome sampling, systemic srs, stratified
sampling, and cluster sampling.
Simple Random Sample
Here simple random sampling was conducted. The data set should of had
a sample of 1000 observations taken at random without replacement. Some
tables follow. the first two may demonstrate the first ten observations
in the original data set and sample. Notice that values for the variable
X in the sample are in ascending order but are not sequential. The next
two tables show the dimensions of the original data set and sample. It
may be clear that the sample contains only 1000 observations, however,
with all the variable from the original data set.
N=10 obs: Original Data Set
| 1 |
1000014003 |
ABC HOBBYCRAFT |
| 2 |
1000024006 |
LANDMARK BAR & GRILLE (THE) |
| 3 |
1000034009 |
WHITLOCK DDS, TODD M. |
| 4 |
1000044001 |
BIG BUCKS PAWN & JEWELRY, LLC |
| 5 |
1000054004 |
ANASTASIA CONFECTIONS, INC. |
| 6 |
1000084002 |
B&T SCREW MACHINE COMPANY, INC |
| 7 |
1000093009 |
MIDDLE ATLANTIC SPORTS CO INC |
| 8 |
1000094005 |
WEAVER PRODUCTS |
| 9 |
1000104006 |
TURTLE BEACH INN |
| 10 |
1000124001 |
INTEXT BUILDING SYS LLC |
N=10 obs: Sample
| 500 |
1003855010 |
JOHN R GRENIER INC |
| 2923 |
1019285010 |
C.M.R.I. Export, Inc. |
| 4572 |
1031435010 |
David L. Nash DBA David L. Nas |
| 6541 |
1044505006 |
Secondary Marketing Strategies |
| 7160 |
1048794002 |
MARK A. AND SHANA V. BOURCIER |
| 7207 |
1049065004 |
Dancing Cranes Salon & Spa LLC |
| 9743 |
1067755003 |
New Look Plus, Inc. |
| 11231 |
1080654006 |
AUTECH, INC. |
| 11323 |
1081254006 |
MONEY SAVERS, INC. |
| 11547 |
1082575006 |
Nanny’s Attic, Inc. |
Dimensions: Original Data Set
| 897167 |
28 |
Dimensions: Sample
| 1000 |
28 |
Systemic Simple Random Sample
Here systemic simple random sampling was conducted. The data set
should of had a sample of about 1000 observations taken with the first
observation being found at random and all other observations being some
constant distance from that observation. Some tables follow. the first
two may demonstrate the first ten observations in the original data set
and sample. Notice that values for the variable X in the sample are in
ascending order but are not sequential. Those observations start at
316261 and were taken every 898 observations from there on. The next two
tables show the dimensions of the original data set and sample. It may
be clear that the sample contains only 649 observations, however, with
all the variable from the original data set.
N=10 obs: Original Data Set
| 1 |
1000014003 |
ABC HOBBYCRAFT |
| 2 |
1000024006 |
LANDMARK BAR & GRILLE (THE) |
| 3 |
1000034009 |
WHITLOCK DDS, TODD M. |
| 4 |
1000044001 |
BIG BUCKS PAWN & JEWELRY, LLC |
| 5 |
1000054004 |
ANASTASIA CONFECTIONS, INC. |
| 6 |
1000084002 |
B&T SCREW MACHINE COMPANY, INC |
| 7 |
1000093009 |
MIDDLE ATLANTIC SPORTS CO INC |
| 8 |
1000094005 |
WEAVER PRODUCTS |
| 9 |
1000104006 |
TURTLE BEACH INN |
| 10 |
1000124001 |
INTEXT BUILDING SYS LLC |
N=10 obs: Sample
| 316261 |
3182695003 |
Vasudev Inc. |
| 317159 |
3189304009 |
BAYSIDE HILLS PASTRY SHOP |
| 318058 |
3195185002 |
DB’S CAFE AND GRILL LLC |
| 318956 |
3201135006 |
My Rescue Enterprises, Inc. |
| 319854 |
3207556000 |
OMNI DESIGN GROUP INC |
| 320753 |
3213405001 |
ACCESS SECURITY, INC. |
| 321654 |
3219895009 |
Daniel B. Orellana DBA America |
| 322553 |
3227425008 |
DON’S DUGOUT LLC |
| 323451 |
3234825008 |
DATTAGE LANDSCAPING, INC |
| 324349 |
3241916005 |
York Woods Tree Service LLC |
Dimensions: Original Data Set
| 897167 |
28 |
Dimensions: Sample
| 649 |
28 |
stratified Sample
Here stratified sampling was conducted. The data set should of had a
sample of about 1000 observations taken from groups within the data set.
About 48 observations from each group should have been taken at random.
Some tables follow. the first two may demonstrate the first ten
observations in the original data set and sample. Notice that values for
the variable X in the sample are in ascending order but are not
sequential also notice the different NAICS codes. The next two tables
show the dimensions of the original data set and sample. It may be clear
that the sample contains 1008 observations, however, with all the
variable from the original data set.
N=10 obs: Original Data Set
| 1 |
1000014003 |
ABC HOBBYCRAFT |
44-45 |
| 2 |
1000024006 |
LANDMARK BAR & GRILLE (THE) |
72 |
| 3 |
1000034009 |
WHITLOCK DDS, TODD M. |
62 |
| 4 |
1000044001 |
BIG BUCKS PAWN & JEWELRY, LLC |
0 |
| 5 |
1000054004 |
ANASTASIA CONFECTIONS, INC. |
0 |
| 6 |
1000084002 |
B&T SCREW MACHINE COMPANY, INC |
31-33 |
| 7 |
1000093009 |
MIDDLE ATLANTIC SPORTS CO INC |
0 |
| 8 |
1000094005 |
WEAVER PRODUCTS |
81 |
| 9 |
1000104006 |
TURTLE BEACH INN |
72 |
| 10 |
1000124001 |
INTEXT BUILDING SYS LLC |
0 |
N=10 obs: Sample
| 758 |
1005375001 |
ACSI, Inc. |
52 |
| 765 |
1005435006 |
Miami Home Shutters, Inc. |
92 |
| 1433 |
1009664006 |
GOLD COAST BUILDERS |
55 |
| 2078 |
1014106010 |
VINYL WORLD GULFPORT INC |
42 |
| 2659 |
1017626008 |
GREAT SHAPE OF CALIFORNIA |
71 |
| 3333 |
1022055001 |
Bernadette Antonino dba Sanisa |
92 |
| 5041 |
1034595007 |
Sofonias, Inc. |
44-45 |
| 5406 |
1036865009 |
GARY M. KEEGAN DBA VALLEY HOT |
42 |
| 5545 |
1037735007 |
Peggy S. Farrell dba Peggy Far |
52 |
| 5800 |
1039565008 |
D & T Drilling, Inc. |
21 |
Dimensions: Original Data Set
| 897167 |
28 |
Dimensions: Sample
| 1008 |
28 |