Data Set Description

The name of the data set is National SBA (Small Business Administration). It is described as census type data. It was contained in 9 csv files eight of which held 100,000 observations each, with the ninth holding 99164 observations. the data sets were combined into one data set called loan, with 899,164 observations total. There are 28 variables in loan. the source of the data set is expressed to be from the United States Small Business Administration. A description of the data set, taken from its origin, states that “this data set is…provides historical data from 1987 through 2014…[for loans]…that was guaranteed to some degree by the SBA. Included is a variable [MIS_Status] which indicates if the loan was paid in full or defaulted/charged off. Below should be a printout of the structure of the data set.

'data.frame':   899164 obs. of  28 variables:
 $ X                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ LoanNr_ChkDgt    : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
 $ Name             : chr  "ABC HOBBYCRAFT" "LANDMARK BAR & GRILLE (THE)" "WHITLOCK DDS, TODD M." "BIG BUCKS PAWN & JEWELRY, LLC" ...
 $ City             : chr  "EVANSVILLE" "NEW PARIS" "BLOOMINGTON" "BROKEN ARROW" ...
 $ State            : chr  "IN" "IN" "IN" "OK" ...
 $ Zip              : int  47711 46526 47401 74012 32801 6062 7083 34491 32456 6073 ...
 $ Bank             : chr  "FIFTH THIRD BANK" "1ST SOURCE BANK" "GRANT COUNTY STATE BANK" "1ST NATL BK & TR CO OF BROKEN" ...
 $ BankState        : chr  "OH" "IN" "IN" "OK" ...
 $ NAICS            : int  451120 722410 621210 0 0 332721 0 811118 721310 0 ...
 $ ApprovalDate     : chr  "28-Feb-97" "28-Feb-97" "28-Feb-97" "28-Feb-97" ...
 $ ApprovalFY       : chr  "1997" "1997" "1997" "1997" ...
 $ Term             : int  84 60 180 60 240 120 45 84 297 84 ...
 $ NoEmp            : int  4 2 7 2 14 19 45 1 2 3 ...
 $ NewExist         : int  2 2 1 1 1 1 2 2 2 2 ...
 $ CreateJob        : int  0 0 0 0 7 0 0 0 0 0 ...
 $ RetainedJob      : int  0 0 0 0 7 0 0 0 0 0 ...
 $ FranchiseCode    : int  1 1 1 1 1 1 0 1 1 1 ...
 $ UrbanRural       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RevLineCr        : chr  "N" "N" "N" "N" ...
 $ LowDoc           : chr  "Y" "Y" "N" "Y" ...
 $ ChgOffDate       : chr  "" "" "" "" ...
 $ DisbursementDate : chr  "28-Feb-99" "31-May-97" "31-Dec-97" "30-Jun-97" ...
 $ DisbursementGross: chr  "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
 $ BalanceGross     : chr  "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
 $ MIS_Status       : chr  "P I F" "P I F" "P I F" "P I F" ...
 $ ChgOffPrinGr     : chr  "$0.00 " "$0.00 " "$0.00 " "$0.00 " ...
 $ GrAppv           : chr  "$60,000.00 " "$40,000.00 " "$287,000.00 " "$35,000.00 " ...
 $ SBA_Appv         : chr  "$48,000.00 " "$32,000.00 " "$215,250.00 " "$28,000.00 " ...

Exploratory Data Analysis

Delete all records whose MIS_Status value was missing

Bellow are two tables that contain the proportions and frequencies of the values in the MIS_Status variable in the data set. The MIS_Status has been expressed to indicate default status. In the first table, it may be seen that there are three types of values for the variable: CHGOFF, PIF, and an an empty character value. The empty character value is assumed to represent missing values. It represents .0022 of the values for MIS_Status. the second table demonstrates the same information as the first table, except the missing values has been remove. Notice that the frequencies of CHGOFF and PIF do not change however their proportions do.

MIS_Status Before
Value FREQ PROP
1997 0.0022210
CHGOFF 157558 0.1752272
P I F 739609 0.8225518
MIS_Status After
Value FREQ PROP
CHGOFF 157558 0.1756172
P I F 739609 0.8243828

Choose a categorical variable and combine its sparse categories.

Below is a proportion/frequency table for the North America Industry Classification System (NAICS) Codes. Above that table is table of some the values for the NAICS variable. It may be clear that these values vary from one another. However, the system codifies them by their first two digits. The proportion/frequency table holds proportions and frequencies of those codes after the values within the NAICS variable were combined with the NAICS codification.

NAICS PRE-Combination
Name NAICS
GOMEZ’S TAQUITOS EXPRESS 0
EUROPEAN TRADITIONS 442110
AUTOCHEM INC 422690
Andrews Chiropractic, P.C. 621310
PRIMARY PREP OF ORLANDO INC 624410
Andrzej Baksik 236118
Timbuktu’s Bar & Grill 722110
PRESTIGE COFFEE SERVICE 0
RKR Transpotation, Inc. 484110
TRUFFLES GRILLE & WINE BAR LLC 722110
NAICS PROP/FREQ
Value FREQ PROP
0 201667 0.2247820
11 8995 0.0100260
21 1851 0.0020632
22 662 0.0007379
23 66492 0.0741133
31-33 67903 0.0756860
42 48673 0.0542519
44-45 126975 0.1415288
48-49 22408 0.0249764
51 11362 0.0126643
52 9470 0.0105554
53 13588 0.0151455
54 67922 0.0757072
55 256 0.0002853
56 32529 0.0362575
61 6401 0.0071347
62 55264 0.0615983
71 14616 0.0162913
72 67511 0.0752491
81 72395 0.0806929
92 227 0.0002530
NAICS PRE-Combination
Name NAICS
GOMEZ’S TAQUITOS EXPRESS 0
EUROPEAN TRADITIONS 44-45
AUTOCHEM INC 42
Andrews Chiropractic, P.C. 62
PRIMARY PREP OF ORLANDO INC 62
Andrzej Baksik 23
Timbuktu’s Bar & Grill 72
PRESTIGE COFFEE SERVICE 0
RKR Transpotation, Inc. 48-49
TRUFFLES GRILLE & WINE BAR LLC 72

Random Sampling

Next random sampling was conducted on the data set. Four procedures were utilized: Simple randome sampling, systemic srs, stratified sampling, and cluster sampling.

Simple Random Sample

Here simple random sampling was conducted. The data set should of had a sample of 1000 observations taken at random without replacement. Some tables follow. the first two may demonstrate the first ten observations in the original data set and sample. Notice that values for the variable X in the sample are in ascending order but are not sequential. The next two tables show the dimensions of the original data set and sample. It may be clear that the sample contains only 1000 observations, however, with all the variable from the original data set.

N=10 obs: Original Data Set
X LoanNr_ChkDgt Name
1 1000014003 ABC HOBBYCRAFT
2 1000024006 LANDMARK BAR & GRILLE (THE)
3 1000034009 WHITLOCK DDS, TODD M.
4 1000044001 BIG BUCKS PAWN & JEWELRY, LLC
5 1000054004 ANASTASIA CONFECTIONS, INC.
6 1000084002 B&T SCREW MACHINE COMPANY, INC
7 1000093009 MIDDLE ATLANTIC SPORTS CO INC
8 1000094005 WEAVER PRODUCTS
9 1000104006 TURTLE BEACH INN
10 1000124001 INTEXT BUILDING SYS LLC
N=10 obs: Sample
X LoanNr_ChkDgt Name
500 1003855010 JOHN R GRENIER INC
2923 1019285010 C.M.R.I. Export, Inc.
4572 1031435010 David L. Nash DBA David L. Nas
6541 1044505006 Secondary Marketing Strategies
7160 1048794002 MARK A. AND SHANA V. BOURCIER
7207 1049065004 Dancing Cranes Salon & Spa LLC
9743 1067755003 New Look Plus, Inc.
11231 1080654006 AUTECH, INC.
11323 1081254006 MONEY SAVERS, INC.
11547 1082575006 Nanny’s Attic, Inc.
Dimensions: Original Data Set
Observations Variables
897167 28
Dimensions: Sample
Observations Variables
1000 28

Systemic Simple Random Sample

Here systemic simple random sampling was conducted. The data set should of had a sample of about 1000 observations taken with the first observation being found at random and all other observations being some constant distance from that observation. Some tables follow. the first two may demonstrate the first ten observations in the original data set and sample. Notice that values for the variable X in the sample are in ascending order but are not sequential. Those observations start at 316261 and were taken every 898 observations from there on. The next two tables show the dimensions of the original data set and sample. It may be clear that the sample contains only 649 observations, however, with all the variable from the original data set.

N=10 obs: Original Data Set
X LoanNr_ChkDgt Name
1 1000014003 ABC HOBBYCRAFT
2 1000024006 LANDMARK BAR & GRILLE (THE)
3 1000034009 WHITLOCK DDS, TODD M.
4 1000044001 BIG BUCKS PAWN & JEWELRY, LLC
5 1000054004 ANASTASIA CONFECTIONS, INC.
6 1000084002 B&T SCREW MACHINE COMPANY, INC
7 1000093009 MIDDLE ATLANTIC SPORTS CO INC
8 1000094005 WEAVER PRODUCTS
9 1000104006 TURTLE BEACH INN
10 1000124001 INTEXT BUILDING SYS LLC
N=10 obs: Sample
X LoanNr_ChkDgt Name
316261 3182695003 Vasudev Inc.
317159 3189304009 BAYSIDE HILLS PASTRY SHOP
318058 3195185002 DB’S CAFE AND GRILL LLC
318956 3201135006 My Rescue Enterprises, Inc.
319854 3207556000 OMNI DESIGN GROUP INC
320753 3213405001 ACCESS SECURITY, INC.
321654 3219895009 Daniel B. Orellana DBA America
322553 3227425008 DON’S DUGOUT LLC
323451 3234825008 DATTAGE LANDSCAPING, INC
324349 3241916005 York Woods Tree Service LLC
Dimensions: Original Data Set
Observations Variables
897167 28
Dimensions: Sample
Observations Variables
649 28

stratified Sample

Here stratified sampling was conducted. The data set should of had a sample of about 1000 observations taken from groups within the data set. About 48 observations from each group should have been taken at random. Some tables follow. the first two may demonstrate the first ten observations in the original data set and sample. Notice that values for the variable X in the sample are in ascending order but are not sequential also notice the different NAICS codes. The next two tables show the dimensions of the original data set and sample. It may be clear that the sample contains 1008 observations, however, with all the variable from the original data set.

N=10 obs: Original Data Set
X LoanNr_ChkDgt Name NCODEs
1 1000014003 ABC HOBBYCRAFT 44-45
2 1000024006 LANDMARK BAR & GRILLE (THE) 72
3 1000034009 WHITLOCK DDS, TODD M. 62
4 1000044001 BIG BUCKS PAWN & JEWELRY, LLC 0
5 1000054004 ANASTASIA CONFECTIONS, INC. 0
6 1000084002 B&T SCREW MACHINE COMPANY, INC 31-33
7 1000093009 MIDDLE ATLANTIC SPORTS CO INC 0
8 1000094005 WEAVER PRODUCTS 81
9 1000104006 TURTLE BEACH INN 72
10 1000124001 INTEXT BUILDING SYS LLC 0
N=10 obs: Sample
X LoanNr_ChkDgt Name NCODEs
758 1005375001 ACSI, Inc. 52
765 1005435006 Miami Home Shutters, Inc. 92
1433 1009664006 GOLD COAST BUILDERS 55
2078 1014106010 VINYL WORLD GULFPORT INC 42
2659 1017626008 GREAT SHAPE OF CALIFORNIA 71
3333 1022055001 Bernadette Antonino dba Sanisa 92
5041 1034595007 Sofonias, Inc. 44-45
5406 1036865009 GARY M. KEEGAN DBA VALLEY HOT 42
5545 1037735007 Peggy S. Farrell dba Peggy Far 52
5800 1039565008 D & T Drilling, Inc. 21
Dimensions: Original Data Set
Observations Variables
897167 28
Dimensions: Sample
Observations Variables
1008 28