For this report, i will again be analyzing Bank Loan Default Data.
The goal of this analysis is to demonstrate 4 types of sampling:
Simplified Random, Stratified Random, Systematic Random, and cluster
sampling.
Again, the data is still split into 9 seperate files. I will combine
all of them into 1 data set.
loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
Definitions
Before we continue, we must define some things and lay some ground
rules. 1) This data set will be treated as a population. This means that
we have a source population that is finite and contains 899,164
subjects. 2) Target Population: The loans within all states and the
regions they will turned into. 3) Our Study Population: The 899,154
values within the BankLoan data set once it is cleaned up for analysis.
4) Sampling Frame: The sampling frame is the data set “loan,” which
contains historical data from the SBA with 27 variables and 899,164
observations. 5) Sampling Unit: Each individual loan represented in the
data set.
Data Preparation
Deleting Missing
values
To start, we will remove missing values for both MIS_Status and
State, as i will use those variables to both our response and
stratification, respectively.
miss = loan[which(loan$MIS_Status == ""),]
newloan <- loan[-which(loan$MIS_Status == ""),]
newloan1 <- newloan[-which(newloan$State == ""),]
Sampling Methods
Simple Random
Sampling
The first of our methods is Simple Random Sampling. For this, I will
just take a random sample from the population. I will randomly choose to
sample 4,000 times.
RandomSample <- newloan1[sample(nrow(newloan1), 4000), ]
RandomSample_dim <- dim(RandomSample)
size_var_count <- data.frame(Size = RandomSample_dim[1], Var.count = RandomSample_dim[2])
kable(t(size_var_count), col.names = c("Size", "Var.count"))
Systematic Random
Sampling Process:
The second sampling method Systematic. For this method, we choose a
starting point, and then every nth data point after is chosen to be in
the sample. For this, I will let R choose the starting point and then
use an interval size of 4,000.
jump.size = dim(newloan1)[1] %/% 4000
rand.starting.pt = sample(1:jump.size, 1)
sampling.id = seq(rand.starting.pt, dim(newloan1)[1], jump.size)
sys.sample = newloan1[sampling.id, ]
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
This resulted in a sample size of 4006. The sample size can vary
depending on start value and jump size.
Stratified Random
Sampling Process:
The third sampling method is stratified. sampling. For this method,
observations are randomly chosen from each strata, aka groups defined
based on a common characteristic. In our case, we will be organizing the
observations by geographical region using state, and then we will sample
from each of the 5 regions that will be created.
Grouping by
Region
Grouping all States by geographical regions:
newloan1 <- newloan1 %>%
mutate(State = case_when(
State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
TRUE ~ as.character(State)
))
Next, we can look at a frequency table for each stratum.
freq.table = table(newloan1$State) # frequency table of strNAICS
rel.freq = freq.table/sum(freq.table) # relative frequency
strata.size = round(rel.freq*4000) # strata size allocation
strata.names=names(strata.size)
Next, we will take a random sample of about 4
strata.sample = newloan1[1,] # create a reference data frame
strata.sample$add.id = 1 # add a temporary ID to because in the loop
# i =2 testing a single iteration
for (i in 1:length(strata.names)){
ith.strata.names = strata.names[i] # extract data frame names
ith.strata.size = strata.size[i] # allocated stratum size
# The following code identifies observations to be selected
ith.sampling.id = which(newloan1$State==ith.strata.names)
ith.strata = newloan1[ith.sampling.id,] # i-th stratified population
ith.strata$add.id = 1:dim(ith.strata)[1] # add sampling list/frame
# The following code generates a subset of random ID
ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size)
## Create a selection status -- pay attention to the operator: %in%
ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
## dim(ith.sample) $ check the sample
strata.sample = rbind(strata.sample, ith.sample) # stack all data frame!
}
# dim(strata.sample)
strat.sample.final = strata.sample[-1,] # drop the temporary stratum ID
#kable(head(strat.sample.final)) # accuracy check!
strat.sample.dim = dim(strat.sample.final)
names(strat.sample.dim) = c("Size", "Var.count")
kable(t(strat.sample.dim))
This method yielded a sample size of 3999.
Cluster Random Sampling
Process:
The last of our sampling methods is clustered. With this method,
observations are again organized into groups via a common
characteristic. However, unlike with Stratified sampling, these groups
are typically broken down further and entire groups are chosen to
compose the sample. For this report, i will further group the
observations by unique zip codes, and then groups of observations will
be chosen at random via a loop.
Defining
Clusters:
Clusters to be defined based on zip codes:
Now, we sample based on zip code:
# Take a cluster sample using zip codes
selected_clusters <- sample(clusters, 20) # Adjust the number of clusters as needed
ClusterSample <- newloan1[newloan1$Zip %in% selected_clusters, ]
# Print the size and variable count of the cluster sample
ClusterSample_dim <- dim(ClusterSample)
size_var_count_cluster <- data.frame(Size = ClusterSample_dim[1], Var.count = ClusterSample_dim[2])
kable(t(size_var_count_cluster), col.names = c("Size", "Var.count"))
This method yielded 664 total clusters.
Summary and
Conclusion
I analyzed bank loan default data. The goal of this report was to
demonstrate how simple, systemic, stratified, and cluster sampling
worked. For simple random sampling, i sampled 4000 units from the
population. For Systemic random sampling, i let R choose the random
starting point and used a jump size of 4 thousand, and that yielded a
sample size of 4006. For stratified random sampling, observations were
grouped based on geographical regions and then each group was sampled,
and that yielded a sample size of 3999. For cluster sampling, clusters
were formed via zip codes and then clusters were chosen at random. This
yielded 664 clusters.
