For this weeks assignment, I am going to perform an exploratory analysis on data that was collected from the U.S Small business administration. The data ranges from the years 1987 to 2014, and each observation is a loan guaranteed the SBA.This large data set contains 27 variables and 899,164 observations. It is split into nine smaller data sets.
First, we are going to load in the nine data sets and combine them into on large one called ‘loan’.
loan01 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://raw.githubusercontent.com/AvaDeSt/bank_data/refs/heads/main/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)
First, I am going to delete all records whose MIS_Status value was missing. This takes the data set down to 897,167 observations.
loans <- na.omit(loan[!is.na(loan$MIS_Status), ])
Now, I am going to stratify the character value ‘State’ into where ever their respective region is located in the United States. This new variable will be called ‘Region’. First, I am going to rename each state with whatever region they are located in the United States and create the stratified variable.
state_to_region <- c(
'AL' = 'Southeast', 'AK' = 'West', 'AZ' = 'West', 'AR' = 'Southeast', 'CA' = 'West',
'CO' = 'West', 'CT' = 'Northeast', 'DC' = 'Northeast', 'DE' = 'Northeast', 'FL' = 'Southeast',
'GA' = 'Southeast', 'HI' = 'West', 'ID' = 'West', 'IL' = 'Midwest', 'IN' = 'Midwest',
'IA' = 'Midwest', 'KS' = 'Midwest', 'KY' = 'Southeast', 'LA' = 'South', 'ME' = 'Northeast',
'MD' = 'Northeast', 'MA' = 'Northeast', 'MI' = 'Midwest', 'MN' = 'Midwest',
'MS' = 'Southeast', 'MO' = 'Midwest', 'MT' = 'West', 'NE' = 'Midwest', 'NV' = 'West',
'NH' = 'Northeast', 'NJ' = 'Northeast', 'NM' = 'West', 'NY' = 'Northeast',
'NC' = 'Southeast', 'ND' = 'Midwest', 'OH' = 'Midwest', 'OK' = 'South', 'OR' = 'West',
'PA' = 'Northeast', 'RI' = 'Northeast', 'SC' = 'Southeast', 'SD' = 'Midwest', 'TN' = 'South',
'TX' = 'South', 'UT' = 'West', 'VT' = 'Northeast', 'VA' = 'South', 'WA' = 'West',
'WV' = 'Southeast', 'WI' = 'Midwest', 'WY' = 'West'
)
loans <- loans %>%
mutate(Region = state_to_region[State])
Now, we can split the data into subsets depending on which region they are located in the United States. There are five regions which are as follows: west, south, Midwest, northeast, and southeast. Now, I will make five separate data sets based on each region.
southeast <- filter(loans, Region == "Southeast")
northeast <- filter(loans, Region == "Northeast")
midwest <- filter(loans, Region == "Midwest")
south <- filter(loans, Region == "South")
west <- filter(loans, Region == "West")
freq.table = table(loans$Region)
kable(t(freq.table))
Midwest | Northeast | South | Southeast | West |
---|---|---|---|---|
202538 | 204001 | 112640 | 116761 | 263069 |
Now, we are going to define clusters using the variable for zip code to use later in a cluster random sample.
clusters <- unique(loans$Zip)
In this section, I will take take random samples from the newly modified data set ‘loans’. I will use four different strategies to take four random samples. This will include a simple random sample, a systematic random sample, a stratified random sample, and a cluster random sample.
First, I will take a simple random sample, I will use a sample size of 4,000.
loans$sampling.frame = 1:length(loans$GrAppv)
# sampling list
# names(study.pop)
# checking the sampling list variable
sampled.list = sample(1:length(loans$GrAppv), 4000)
# sampling the list
SRS.sample = loans[sampled.list,]
# extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))
Size | Var.count |
---|---|
4000 | 29 |
Next, I will take a stratified random sample which will also have a sample size of 4,000.
jump.size = dim(loans)[1]%/%4000
# find the jump size in the systematic sampling
# jump.size
rand.starting.pt=sample(1:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(loans)[1], jump.size) # sampling IDs
#length(sampling.id)
sys.sample=loans[sampling.id,]
# extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
Size | Var.count |
---|---|
4013 | 29 |
As we can see, the actual sample we get using this method is 4013, despite the target size being 4000. This is because the jump size involves rounding error and the population is large.
Now, we will take a stratified random sample based on the Region Variable that we created previously. First we will take the SRS from each stratum
freq.table = table(loans$Region)
rel.freq = freq.table/sum(freq.table)
strata.size = round(rel.freq*4000)
strata.names=names(strata.size)
kable(t(strata.size))
Midwest | Northeast | South | Southeast | West |
---|---|---|---|---|
901 | 908 | 501 | 520 | 1170 |
Now, we will take a stratified sample.
strata.sample = loans[1,]
strata.sample$add.id = 1
for (i in 1:length(strata.names)){
ith.strata.names = strata.names[i]
ith.strata.size = strata.size[i]
ith.sampling.id = which(loans$Region==ith.strata.names)
ith.strata = loans[ith.sampling.id,]
ith.strata$add.id = 1:dim(ith.strata)[1]
ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size)
ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
strata.sample = rbind(strata.sample, ith.sample)
}
strat.sample.final = strata.sample[-1,]
Finally, we will take a cluster random sample from the cluster identified earlier based on the variable for zip code.
set.seed(123)
sampled_clusters <- sample(clusters, size = 10)
sampled_data <- loans %>% filter(Zip %in% sampled_clusters)
head(sampled_data)
LoanNr_ChkDgt Name City State Zip
1 1025315002 Northeast Collision, Inc. ELMA NY 14059
2 1028384000 PIONEER INDUSTRIAL SALES, LLC KINGSPORT TN 37660
3 1116524005 CHRISTOPHER DAVID'S STUDIO OF ELMA NY 14059
4 1116555010 Altech, Incorporated KINGSPORT TN 37660
5 1179315005 JAMES A CARR DBA GOLFER'S EDGE KINGSPORT TN 37660
6 1187614005 MUSSETTER DISTRIBUTING, INC. AUBURN CA 95602
Bank BankState NAICS ApprovalDate ApprovalFY Term
1 MANUFACTURERS & TRADERS TR CO NY 811121 20-Jul-04 2004 48
2 1ST BK & TR CO VA 0 12-Mar-97 1997 84
3 BANK OF AMERICA NATL ASSOC RI 0 10-Apr-97 1997 60
4 BANK OF AMERICA NATL ASSOC NC 811212 1-Dec-04 2005 84
5 CAPITAL ONE NATL ASSOC VA 451110 2-Feb-05 2005 30
6 MUFG UNION BANK NATL ASSOC CA 422810 1-May-97 1997 120
NoEmp NewExist CreateJob RetainedJob FranchiseCode UrbanRural RevLineCr
1 30 1 0 0 10729 2 Y
2 3 2 0 0 1 0 N
3 4 1 0 0 1 0 N
4 6 1 0 0 1 1 Y
5 5 1 1 5 1 1 N
6 25 1 0 0 1 0 N
LowDoc ChgOffDate DisbursementDate DisbursementGross BalanceGross MIS_Status
1 N 31-Aug-04 $450,890.00 $0.00 P I F
2 Y 30-Apr-97 $50,000.00 $0.00 P I F
3 Y 28-Apr-97 $10,000.00 $0.00 P I F
4 N 31-Dec-04 $50,000.00 $0.00 P I F
5 N 11-Feb-10 28-Feb-05 $50,000.00 $0.00 CHGOFF
6 N 31-Jul-97 $550,000.00 $0.00 P I F
ChgOffPrinGr GrAppv SBA_Appv Region sampling.frame
1 $0.00 $200,000.00 $100,000.00 Northeast 3753
2 $0.00 $50,000.00 $40,000.00 South 4158
3 $0.00 $10,000.00 $8,000.00 Northeast 16031
4 $0.00 $50,000.00 $25,000.00 South 16036
5 $24,695.00 $50,000.00 $25,000.00 South 23882
6 $0.00 $550,000.00 $412,500.00 West 24976
For this assignment we looked at a large data set containing bank loan information from the SBA. After combining all of our data into one large data set, we performed some data cleaning be removing missing values from the MIS_Status variable, and defined our stratified variable which was region. We then took four different random samples, a simple, systematic, stratified, and cluster random sample. For the simple, systematic, and stratified random samples, I used a sample size of 4000. For the cluster random sample, we used the variable for zip code. The samples yielded were all different and had certain advantages. for example, the simple random sample was the most ‘random’ of them all. But the cluster random sample is likely the most practical since it would be the most cost effective in a real experiment. While a stratified random sample is best if you want samples only from certain demographics of a population. The optimal random sample to use depends on the study or experiment being performed.