1 Introduction

In this note, I will introduce the steps for taking random samples from the study population. The Bank load data set is treated as a population. We will use this data set as a population to implement various sampling plans.

The original data set was split into 9 subsets that are stored on GitHub. We first load these data sets to R and then combine them as a single data set.

# Read in the dataset
loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)
## Remove NAs from MIS_status
library(tidyverse)
bankLoan <- loan
##convert dollars to numeric
##Variables DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, SBA_Appv
colnames(bankLoan)
 [1] "LoanNr_ChkDgt"     "Name"              "City"             
 [4] "State"             "Zip"               "Bank"             
 [7] "BankState"         "NAICS"             "ApprovalDate"     
[10] "ApprovalFY"        "Term"              "NoEmp"            
[13] "NewExist"          "CreateJob"         "RetainedJob"      
[16] "FranchiseCode"     "UrbanRural"        "RevLineCr"        
[19] "LowDoc"            "ChgOffDate"        "DisbursementDate" 
[22] "DisbursementGross" "BalanceGross"      "MIS_Status"       
[25] "ChgOffPrinGr"      "GrAppv"            "SBA_Appv"         
bankLoan$DisbursementGross <- as.numeric(gsub('[$,]', '', bankLoan$DisbursementGross))
bankLoan$BalanceGross <- as.numeric(gsub('[$,]', '', bankLoan$BalanceGross))
bankLoan$ChgOffPrinGr <- as.numeric(gsub('[$,]', '', bankLoan$ChgOffPrinGr))
bankLoan$GrAppv <- as.numeric(gsub('[$,]', '', bankLoan$GrAppv))
bankLoan$SBA_Appv <- as.numeric(gsub('[$,]', '', bankLoan$SBA_Appv))

1.1 Stratifcation Variable

Here is the distribution of original Number of Employees variable. This will help us make our groupings for this variable.

hist(bankLoan$NoEmp)

1.2 Combining Categories

We now combine the number of employees the business has into categories. We created a new variable NoEmpBin based off of NoEmp that grouped it into five different categories. NoEmpBin will be our operational stratification variable.

# Re-group the number of business employees
library(dplyr)
bankLoan <- bankLoan %>% 
  mutate( NoEmpBin = case_when(NoEmp <= 5 ~ '<=5',
                                     NoEmp > 5 & NoEmp <= 25 ~ '6-25',
                                     NoEmp > 25 & NoEmp <=50 ~ '26-50',
                                     NoEmp > 50 & NoEmp <= 100 ~ '51-100',
                                     NoEmp > 100  ~ '101+')) 

1.3 Loan Default Rates By Number of Employees

We now find the loan default rates by the number of employees defined by the stratification variable NoEmpBin The loan default status can be defined by the variable MIS_Status.

x.table = table(bankLoan$NoEmpBin, bankLoan$MIS_Status)
no.lab = x.table[,1]  # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
kable(default.status.rate)
no.lab default no.default default.rate
<=5 1568 109463 412788 21.0
101+ 36 419 6998 5.6
26-50 29 4619 48068 8.8
51-100 26 1301 18571 6.5
6-25 338 41756 253184 14.2

1.4 Study Population

Based on the above frequency distribution of the modified number of employees bins.

study.pop = bankLoan
kable(t(table(bankLoan$NoEmpBin))) # Checking correctness operation
<=5 101+ 26-50 51-100 6-25
523819 7453 52716 19898 295278

So we have defined our study population!

2 Sampling Plans

In this section, we are implementing three sampling plans. In each sampling plan, we select 4000 observations in the corresponding samples.

2.1 Simple Random Sampling

We define a sampling list and add it to the study population.

study.pop$sampling.frame = 1:length(study.pop$GrAppv)   
# sampling list
# names(study.pop)                                     
# checking the sampling list variable
sampled.list = sample(1:length(study.pop$GrAppv), 4000) 
# sampling the list
SRS.sample = study.pop[sampled.list,]                  
# extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))   # checking the sample size
Size Var.count
4000 29

2.2 Systematic sampling

jump.size = dim(study.pop)[1]%/%4000  
# find the jump size in the systematic sampling
# jump.size
rand.starting.pt=sample(2:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)  # sampling IDs
#length(sampling.id)
sys.sample=study.pop[sampling.id,]    
# extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
Size Var.count
4014 29

Because the jump size involves rounding error and the population is large, the actual systematic sample size is slightly different from the target size. In this report, I used the integer part of the actual jump size. The actual systematic sampling size is slightly bigger than the target size. We can take away some records random from the systematic sample to make the size to be equal to the target size.

2.3 Stratified Sampling

We take an SRS from each stratum. The sample size should be approximately proportional to the size of the corresponding stratum.

First, we calculate the SRS size for each stratum and then take the SRS from the corresponding stratum.

freq.table = table(study.pop$NoEmpBin)  # frequency table of strNAICS
rel.freq = freq.table/sum(freq.table)   # relative frequency 
strata.size = round(rel.freq*4000)      # strata size allocation
strata.names=names(strata.size)         # extract strNAICS names for accuracy checking
kable(t(strata.size))  # make a nice-looking table using kable().
<=5 101+ 26-50 51-100 6-25
2330 33 235 89 1314

In the following code chunk, we take stratified samples.

strata.sample = study.pop[1,]    # create a reference data frame
strata.sample$add.id = 1   # add a temporary ID to because in the loop
                           # i =2 testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i]   # extract data frame names
   ith.strata.size = strata.size[i]     # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(study.pop$NoEmpBin==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status -- pay attention to the operator: %in% 
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 # dim(strata.sample)
 strat.sample.final = strata.sample[-1,]  # drop the temporary stratum ID
 #kable(head(strat.sample.final))         # accuracy check!

2.4 Cluster Sampling

Here we will do a cluster sampling method based off of zip codes. We will cluster the data by zip codes and then take a random sample of zip codes. Our sample size is roughly around 4000.

#unique zip codes
unizip <- unique(study.pop$Zip)

#random sample of zip codes
# checking the sampling list variable
sampled.zip = sample(unizip, 170) 

# join the data to the zip codes
cluster <- study.pop[study.pop$Zip %in% sampled.zip, ]

sys.clsuter = dim(cluster)
names(sys.clsuter) = c("Size", "Var.count")
kable(t(sys.clsuter))
Size Var.count
4598 29
---
title: "Implementing Ramdom Sampling Plans "
author: "Gianna LaFrance"
date: "  "
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_height: 4
---

```{=html}
<style type="text/css">
h1.title {
  font-size: 20px;
  text-align: center;
}
h4.author { 
    font-size: 18px;
    text-align: center;
}
h4.date { 
  font-size: 18px;
  text-align: center;
}
h1 {
    font-size: 22px;
    text-align: center;
}
h2 {
    font-size: 18px;
    text-align: left;
}

div#TOC li {
    list-style:none;
}
</style>
```
```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("lessR")) {
   install.packages("lessR")
   library(lessR)
}

knitr::opts_chunk$set(echo = TRUE,       
                      warnings = FALSE,   
                      results = TRUE,   
                      message = FALSE,
                      comment = NA)
```


# Introduction


In this note, I will introduce the steps for taking random samples from the study population. The Bank load data set is treated as a population. We will use this data set as a population to implement various sampling plans.

The original data set was split into 9 subsets that are stored on GitHub. We first load these data sets to R and then combine them as a single data set.

```{r dataset}
# Read in the dataset
loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)
```

```{r}
## Remove NAs from MIS_status
library(tidyverse)
bankLoan <- loan
```

```{r}
##convert dollars to numeric
##Variables DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, SBA_Appv
colnames(bankLoan)

bankLoan$DisbursementGross <- as.numeric(gsub('[$,]', '', bankLoan$DisbursementGross))
bankLoan$BalanceGross <- as.numeric(gsub('[$,]', '', bankLoan$BalanceGross))
bankLoan$ChgOffPrinGr <- as.numeric(gsub('[$,]', '', bankLoan$ChgOffPrinGr))
bankLoan$GrAppv <- as.numeric(gsub('[$,]', '', bankLoan$GrAppv))
bankLoan$SBA_Appv <- as.numeric(gsub('[$,]', '', bankLoan$SBA_Appv))

```


## Stratifcation Variable


Here is the distribution of original Number of Employees variable. This will help us make our groupings for this variable.

```{r data-size}
hist(bankLoan$NoEmp)
```


## Combining Categories

We now combine the number of employees the business has into categories. We created a new variable NoEmpBin based off of NoEmp that grouped it into five different categories. NoEmpBin will be our operational stratification variable.


```{r}
# Re-group the number of business employees
library(dplyr)
bankLoan <- bankLoan %>% 
  mutate( NoEmpBin = case_when(NoEmp <= 5 ~ '<=5',
                                     NoEmp > 5 & NoEmp <= 25 ~ '6-25',
                                     NoEmp > 25 & NoEmp <=50 ~ '26-50',
                                     NoEmp > 50 & NoEmp <= 100 ~ '51-100',
                                     NoEmp > 100  ~ '101+')) 

```


## Loan Default Rates By Number of Employees

We now find the loan default rates by the number of employees defined by the stratification variable NoEmpBin The loan default status can be defined by the variable MIS_Status.

```{r}
x.table = table(bankLoan$NoEmpBin, bankLoan$MIS_Status)
no.lab = x.table[,1]  # first column consists of unknown default label
default = x.table[,2]
no.default = x.table[,3]
default.rate = round(100*default/(default+no.default),1)
default.status.rate = cbind(no.lab = no.lab, 
                          default = default, 
                          no.default = no.default,
                          default.rate=default.rate)
kable(default.status.rate)
```

## Study Population

Based on the above frequency distribution of the modified number of employees bins.

```{r}
study.pop = bankLoan
kable(t(table(bankLoan$NoEmpBin))) # Checking correctness operation
```

So we have defined our study population!

# Sampling Plans

In this section, we are implementing three sampling plans. In each sampling plan, we select 4000 observations in the corresponding samples.

## Simple Random Sampling

We define a sampling list and add it to the study population.

```{r}
study.pop$sampling.frame = 1:length(study.pop$GrAppv)   
# sampling list
# names(study.pop)                                     
# checking the sampling list variable
sampled.list = sample(1:length(study.pop$GrAppv), 4000) 
# sampling the list
SRS.sample = study.pop[sampled.list,]                  
# extract the sampling units (observations)
## dimension check
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))   # checking the sample size
```

## Systematic sampling

```{r}
jump.size = dim(study.pop)[1]%/%4000  
# find the jump size in the systematic sampling
# jump.size
rand.starting.pt=sample(2:jump.size,1) # find the random starting value
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)  # sampling IDs
#length(sampling.id)
sys.sample=study.pop[sampling.id,]    
# extract the sampling units of systematic samples
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
```

Because the jump size involves rounding error and the population is large, the actual systematic sample size is slightly different from the target size. In this report, I used the integer part of the actual jump size. The actual systematic sampling size is slightly bigger than the target size. We can take away some records random from the systematic sample to make the size to be equal to the target size.

## Stratified Sampling

We take an SRS from each stratum. The sample size should be approximately proportional to the size of the corresponding stratum.

First, we calculate the SRS size for each stratum and then take the SRS from the corresponding stratum.

```{r}
freq.table = table(study.pop$NoEmpBin)  # frequency table of strNAICS
rel.freq = freq.table/sum(freq.table)   # relative frequency 
strata.size = round(rel.freq*4000)      # strata size allocation
strata.names=names(strata.size)         # extract strNAICS names for accuracy checking
```

```{r}
kable(t(strata.size))  # make a nice-looking table using kable().
```

In the following code chunk, we take stratified samples.

```{r}
strata.sample = study.pop[1,]    # create a reference data frame
strata.sample$add.id = 1   # add a temporary ID to because in the loop
                           # i =2 testing a single iteration
for (i in 1:length(strata.names)){
   ith.strata.names = strata.names[i]   # extract data frame names
   ith.strata.size = strata.size[i]     # allocated stratum size
   # The following code identifies observations to be selected
   ith.sampling.id = which(study.pop$NoEmpBin==ith.strata.names) 
   ith.strata = study.pop[ith.sampling.id,]  # i-th stratified population
   ith.strata$add.id = 1:dim(ith.strata)[1]  # add sampling list/frame
   # The following code generates a subset of random ID
   ith.sampling.id = sample(1:dim(ith.strata)[1], ith.strata.size) 
   ## Create a selection status -- pay attention to the operator: %in% 
   ith.sample =ith.strata[ith.strata$add.id %in%ith.sampling.id,]
   ## dim(ith.sample)         $ check the sample
   strata.sample = rbind(strata.sample, ith.sample)  # stack all data frame!
 }
 # dim(strata.sample)
 strat.sample.final = strata.sample[-1,]  # drop the temporary stratum ID
 #kable(head(strat.sample.final))         # accuracy check!
```

## Cluster Sampling

Here we will do a cluster sampling method based off of zip codes. We will cluster the data by zip codes and then take a random sample of zip codes. Our sample size is roughly around 4000.

```{r}
#unique zip codes
unizip <- unique(study.pop$Zip)

#random sample of zip codes
# checking the sampling list variable
sampled.zip = sample(unizip, 170) 

# join the data to the zip codes
cluster <- study.pop[study.pop$Zip %in% sampled.zip, ]

sys.clsuter = dim(cluster)
names(sys.clsuter) = c("Size", "Var.count")
kable(t(sys.clsuter))

```


