For this report, i will again be analyzing Bank Loan Default Data.
The goal of this analysis is to demonstrate 4 types of sampling:
Simplified Random, Stratified Random, Systematic Random, and cluster
sampling.
Again, the data is still split into 9 seperate files. I will combine
all of them into 1 data set.
loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
Definitions
Before we continue, we must define some things and lay some ground
rules. 1) This data set will be treated as a population. This means that
we have a source population that is finite and contains 899,164
subjects. 2) Target Population: The loans within all states and the
regions they will turned into. 3) Our Study Population: The 899,154
values within the BankLoan data set once it is cleaned up for analysis.
4) Sampling Frame: The sampling frame is the data set “loan,” which
contains historical data from the SBA with 27 variables and 899,164
observations. 5) Sampling Unit: Each individual loan represented in the
data set.
Data Preparation
Deleting Missing
values
To start, we will remove missing values for both MIS_Status and
State, as i will use those variables to both our response and
stratification, respectively.
miss = loan[which(loan$MIS_Status == ""),]
newloan <- loan[-which(loan$MIS_Status == ""),]
newloan1 <- newloan[-which(newloan$State == ""),]
Sampling Methods
Simple Random
Sampling
The first of our methods is Simple Random Sampling. For this, I will
just take a random sample from the population. I will randomly choose to
sample 4,000 times.
RandomSample <- newloan1[sample(nrow(newloan1), 4000), ]
RandomSample_dim <- dim(RandomSample)
size_var_count <- data.frame(Size = RandomSample_dim[1], Var.count = RandomSample_dim[2])
kable(t(size_var_count), col.names = c("Size", "Var.count"))
Systematic Random
Sampling Process:
The second sampling method Systematic. For this method, we choose a
starting point, and then every nth data point after is chosen to be in
the sample. For this, I will let R choose the starting point and then
use an interval size of 4,000.
jump.size = dim(newloan1)[1] %/% 4000
rand.starting.pt = sample(1:jump.size, 1)
sampling.id = seq(rand.starting.pt, dim(newloan1)[1], jump.size)
sys.sample = newloan1[sampling.id, ]
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
This resulted in a sample size of 4006. The sample size can vary
depending on start value and jump size.
Stratified Random
Sampling Process:
The third sampling method is stratified. sampling. For this method,
observations are randomly chosen from each strata, aka groups defined
based on a common characteristic. In our case, we will be organizing the
observations by geographical region using state, and then we will sample
from each of the 5 regions that will be created.
Grouping by
Region
Grouping all States by geographical regions:
newloan1 <- newloan1 %>%
mutate(State = case_when(
State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
TRUE ~ as.character(State)
))
Next, we can look at a frequency table for each stratum.
freq.table = table(newloan1$State) # frequency table of strNAICS
rel.freq = freq.table/sum(freq.table) # relative frequency
strata.size = round(rel.freq*4000) # strata size allocation
strata.names=names(strata.size)
Next, we will take a random sample of about 4
strata.sample = newloan1[1,] # create a reference data frame
strata.sample$add.id = 1 # add a temporary ID to because in the loop
# i =2 testing a single iteration
for (i in 1:length(strata.names)){
ith.strata.names = strata.names[i] # extract data frame names
ith.strata.size = strata.size[i] # allocated stratum size
# The following code identifies observations to be selected
ith.sampling.id = which(newloan1$State==ith.strata.names)
ith.strata = newloan1[ith.sampling.id,] # i-th stratified population
ith.strata$add.id = 1:dim(ith.strata)[1] # add sampling list/frame
# The following code generates a subset of random ID
ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size)
## Create a selection status -- pay attention to the operator: %in%
ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
## dim(ith.sample) $ check the sample
strata.sample = rbind(strata.sample, ith.sample) # stack all data frame!
}
# dim(strata.sample)
strat.sample.final = strata.sample[-1,] # drop the temporary stratum ID
#kable(head(strat.sample.final)) # accuracy check!
strat.sample.dim = dim(strat.sample.final)
names(strat.sample.dim) = c("Size", "Var.count")
kable(t(strat.sample.dim))
This method yielded a sample size of 3999.
Cluster Random Sampling
Process:
The last of our sampling methods is clustered. With this method,
observations are again organized into groups via a common
characteristic. However, unlike with Stratified sampling, these groups
are typically broken down further and entire groups are chosen to
compose the sample. For this report, i will further group the
observations by unique zip codes, and then groups of observations will
be chosen at random via a loop.
Defining
Clusters:
Clusters to be defined based on zip codes:
Now, we sample based on zip code:
# Take a cluster sample using zip codes
selected_clusters <- sample(clusters, 20) # Adjust the number of clusters as needed
ClusterSample <- newloan1[newloan1$Zip %in% selected_clusters, ]
# Print the size and variable count of the cluster sample
ClusterSample_dim <- dim(ClusterSample)
size_var_count_cluster <- data.frame(Size = ClusterSample_dim[1], Var.count = ClusterSample_dim[2])
kable(t(size_var_count_cluster), col.names = c("Size", "Var.count"))
This method yielded 664 total clusters.
Summary and
Conclusion
I analyzed bank loan default data. The goal of this report was to
demonstrate how simple, systemic, stratified, and cluster sampling
worked. For simple random sampling, i sampled 4000 units from the
population. For Systemic random sampling, i let R choose the random
starting point and used a jump size of 4 thousand, and that yielded a
sample size of 4006. For stratified random sampling, observations were
grouped based on geographical regions and then each group was sampled,
and that yielded a sample size of 3999. For cluster sampling, clusters
were formed via zip codes and then clusters were chosen at random. This
yielded 664 clusters.
---
title: "Implementing Ramdom Sampling Plans "
author: "Ian VanWright"
date: "05/09/2024"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_height: 4
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 5
    fig_height: 4
---

```{=html}
<style type="text/css">
h1.title {
  font-size: 20px;
  text-align: center;
}
h4.author { 
    font-size: 18px;
    text-align: center;
}
h4.date { 
  font-size: 18px;
  text-align: center;
}
h1 {
    font-size: 22px;
    text-align: center;
}
h2 {
    font-size: 18px;
    text-align: left;
}

div#TOC li {
    list-style:none;
}
</style>
```
```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("lessR")) {
   install.packages("lessR")
   library(lessR)
}
library(dplyr)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE,       
                      warnings = FALSE,   
                      results = TRUE,   
                      message = FALSE,
                      comment = NA)
```


For this report, i will again be analyzing Bank Loan Default Data. The goal of this analysis is to demonstrate 4 types of sampling: Simplified Random, Stratified Random, Systematic Random, and cluster sampling.
 
Again, the data is still split into 9 seperate files. I will combine all of them into 1 data set.
```{r}
loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
```


# Definitions
Before we continue, we must define some things and lay some ground rules. 
1) This data set will be treated as a population. This means that we have a source population that is finite and contains 899,164 subjects. 
2) Target Population: The loans within all states and the regions they will turned into. 
3) Our Study Population: The 899,154 values within the BankLoan data set once it is cleaned up for analysis.
4) Sampling Frame: The sampling frame is the data set "loan," which contains historical data from the SBA with 27 variables and 899,164 observations.
5) Sampling Unit: Each individual loan represented in the data set. 


# Data Preparation

## Deleting Missing values

To start, we will remove missing values for both MIS_Status and State, as i will use those variables to both our response and stratification, respectively.
```{r}
miss = loan[which(loan$MIS_Status == ""),]
newloan <- loan[-which(loan$MIS_Status == ""),]
newloan1 <- newloan[-which(newloan$State == ""),]
```

# Sampling Methods
## Simple Random Sampling
The first of our methods is Simple Random Sampling. For this, I will just take a random sample from the population. I will randomly choose to sample 4,000 times. 
```{r}
RandomSample <- newloan1[sample(nrow(newloan1), 4000), ]
RandomSample_dim <- dim(RandomSample)
size_var_count <- data.frame(Size = RandomSample_dim[1], Var.count = RandomSample_dim[2])

kable(t(size_var_count), col.names = c("Size", "Var.count"))
```

## Systematic Random Sampling Process:
The second sampling method Systematic. For this method, we choose a starting point, and then every nth data point after is chosen to be in the sample. For this, I will let R choose the starting point and then use an interval size of 4,000.
```{r}
jump.size = dim(newloan1)[1] %/% 4000
rand.starting.pt = sample(1:jump.size, 1)
sampling.id = seq(rand.starting.pt, dim(newloan1)[1], jump.size)
sys.sample = newloan1[sampling.id, ]

sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
```

This resulted in a sample size of 4006. The sample size can vary depending on start value and jump size.

# Stratified Random Sampling Process:
The third sampling method is stratified. sampling. For this method, observations are randomly chosen from each strata, aka groups defined based on a common characteristic. In our case, we will be organizing the observations by geographical region using state, and then we will sample from each of the 5 regions that will be created.


## Grouping by Region
Grouping all States by geographical regions:
```{r}
newloan1 <- newloan1 %>%
  mutate(State = case_when(
    State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
    State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
    TRUE ~ as.character(State)
  ))
```

Next, we can look at a frequency table for each stratum.
```{r}
freq.table = table(newloan1$State)  # frequency table of strNAICS
rel.freq = freq.table/sum(freq.table)   # relative frequency 
strata.size = round(rel.freq*4000)      # strata size allocation
strata.names=names(strata.size)  
```

Next, we will take a random sample of about 4
```{r}
strata.sample = newloan1[1,]    # create a reference data frame
strata.sample$add.id = 1   # add a temporary ID to because in the loop
                           # i =2 testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i]   # extract data frame names
   ith.strata.size = strata.size[i]     # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(newloan1$State==ith.strata.names) 
   ith.strata = newloan1[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status -- pay attention to the operator: %in% 
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 # dim(strata.sample)
 strat.sample.final = strata.sample[-1,]  # drop the temporary stratum ID
 #kable(head(strat.sample.final))         # accuracy check!
 
strat.sample.dim = dim(strat.sample.final)
names(strat.sample.dim) = c("Size", "Var.count")
kable(t(strat.sample.dim))
 
```

This method yielded a sample size of 3999.

# Cluster Random Sampling Process:
The last of our sampling methods is clustered. With this method, observations are again organized into groups via a common characteristic. However, unlike with Stratified sampling, these groups are typically broken down further and entire groups are chosen to compose the sample. For this report, i will further group the observations by unique zip codes, and then groups of observations will be chosen at random via a loop.

## Defining Clusters: 

Clusters to be defined based on zip codes:
```{r, echo=FALSE, warning=FALSE}
clusters <- unique(newloan1$Zip)

```

Now, we sample based on zip code:
```{r}

# Take a cluster sample using zip codes
selected_clusters <- sample(clusters, 20)  # Adjust the number of clusters as needed
ClusterSample <- newloan1[newloan1$Zip %in% selected_clusters, ]


# Print the size and variable count of the cluster sample
ClusterSample_dim <- dim(ClusterSample)
size_var_count_cluster <- data.frame(Size = ClusterSample_dim[1], Var.count = ClusterSample_dim[2])
kable(t(size_var_count_cluster), col.names = c("Size", "Var.count"))
```
This method yielded 664 total clusters.

## Summary and Conclusion
I analyzed bank loan default data. The goal of this report was to demonstrate how simple, systemic, stratified, and cluster sampling worked. For simple random sampling, i sampled 4000 units from the population. For Systemic random sampling, i let R choose the random starting point and used a jump size of 4 thousand, and that yielded a sample size of 4006. For stratified random sampling, observations were grouped based on geographical regions and then each group was sampled, and that yielded a sample size of 3999. For cluster sampling, clusters were formed via zip codes and then clusters were chosen at random. This yielded 664 clusters.


