Combined Data
loan01 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)
Mis_Status Missing
loan <- loan[!is.na(loan$MIS_Status), ]
Mis_Status is the status of the persons loan. This is one of the most
important variables in our dataset. It ensures us that when we do our
analysis we have a complete observation telling us if the loan was
approved, paid, or defaulted. So we remove any observations with loan
status missing for clarity.
Convert Variables
currency_vars <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")
loan[currency_vars] <- lapply(loan[currency_vars], function(x) as.numeric(gsub("[$,]", "", x)))
Since DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and
SBA_Appv are important measures for our analysis we want to be able to
use them more methodically. So we convert these categorical variables to
numeric variables and remove any character values. Working with numbers
as opposed to charterers is better for statistical analysis,
visualization, and overall modeling. ## Categorical Variable rework
loan <- loan %>%
mutate(BankRegion = case_when(
BankState %in% c("CT", "ME", "MA", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
BankState %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
BankState %in% c("AL", "AR", "DE", "DC", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV") ~ "South",
BankState %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
TRUE ~ "Unknown"
))
table(loan$BankRegion)
Midwest Northeast South Unknown West
269885 150560 275751 1730 201238
I decided to make 4 regional categories of the states. These
categories being Northeast, Midwest, South, West and also an unknown. By
doing this we can see if there are any patterns for any specific region.
We can use this to compare approval rates, default rates, or average
loan amounts by region. This also makes it easier for data visualization
to again see if there is any disparities by region.
Default Rates based on
Region
default_rates <- loan %>%
group_by(BankRegion) %>%
summarise(
Total_Loans = n(),
Defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
Default_Rate = Defaults / Total_Loans * 100
)
print(default_rates)
# A tibble: 5 × 4
BankRegion Total_Loans Defaults Default_Rate
<chr> <int> <int> <dbl>
1 Midwest 269885 42537 15.8
2 Northeast 150560 21021 14.0
3 South 275751 58751 21.3
4 Unknown 1730 106 6.13
5 West 201238 35143 17.5
This code calculates the default rate of SBA-backed loans for each
region. With this code it helps us to compare which regions have higher
or lower default rates. It also gives insight into regional economic
trends and lending practices. We see that in the South the default rate
is the highest at 21% meaning that more loans go unpaid here than in any
other region.
Discretize GrApprv
loan <- loan %>%
mutate(GrAppv_Cat = cut(GrAppv,
breaks = quantile(GrAppv, probs = seq(0, 1, by = 0.2), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Very Low", "Low", "Medium", "High", "Very High")))
table(loan$GrAppv_Cat)
Very Low Low Medium High Very High
180997 179102 179575 179677 179813
In the code above we categorize the SBA guaranteed approval amount
into five groups. By categorizing and making this variable into small
intervals we convert it from a continuous variable we now have a
discretized variable. Discretized variables are easier to interpret. We
see from the output above that they are evenly distributed meaning our
approval spans a wide range of amounts.
Study Population
study.pop <- loan %>%
filter(BankRegion != "Unknown")
kable(t(table(study.pop$BankRegion)))
269885 |
150560 |
275751 |
201238 |
For the study population we remove the unknown group because there is
only 1730 observations which is too small and does not match up with our
other categories. We also cannot draw conclusions about the region if we
do not know where these observations are coming from. So now we have
defined our study population.
Sample Calculation
std_dev <- sd(loan$SBA_Appv, na.rm = TRUE)
Z_alpha <- 1.96
Z_beta <- 0.84
delta <- 5000
n <- (2 * (Z_alpha + Z_beta)^2 * std_dev^2) / delta^2
n_rounded <- ceiling(n)
cat("The required sample size per group is:", n_rounded, "\n")
The required sample size per group is: 32724
After making a calculation for our sample we get a sample size of
32724 which is exactly where we need the size to be. It is less than 5%
of our dataset but is also a relatively big number.
Simple Random
Sample
study.pop$sampling.frame = 1:length(study.pop$GrAppv)
sampled.list = sample(1:length(study.pop$GrAppv), 32724)
SRS.sample = study.pop[sampled.list,]
dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))
We performed a simple random sample. First we assigned a unique index
to each observation using sampling.frame. Then we randomly selected
32,724 observations from the dataset without replacement. Finally we
extract the sampled observations into SRS.sample to make a table.
Systematic
Sampling
jump.size = dim(study.pop)[1]%/% 32724
rand.starting.pt=sample(1:jump.size,1)
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)
sys.sample=study.pop[sampling.id,]
sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))
We performed a systematic sample. First we calculated the jump size
by dividing the total number of observations by 32,724, determining the
interval at which samples will be selected. Then we randomly picked a
starting point within the first interval. Using this starting point we
generate a sequence of indices at regular intervals to select
observations.
Stratefied
Sampling
freq.table = table(study.pop$BankRegion)
rel.freq = freq.table/sum(freq.table)
strata.size = round(rel.freq*32724)
strata.names=names(strata.size)
kable(t(strata.size))
We performed a Stratified Sampling by first creating a frequency
table of the BankRegion variable, which represents different categories.
Then we calculate the relative frequency of each category by dividing
the counts by the total number of observations. Next we determine the
sample size for each categories by multiplying the relative frequencies
by 32,724 and rounding the values.
strata.sample = study.pop[0,]
for (i in 1:length(strata.names)) {
ith.strata.names = strata.names[i]
ith.strata.size = strata.size[i]
ith.sampling.id = which(study.pop$BankRegion == ith.strata.names)
if (length(ith.sampling.id) > 0 && ith.strata.size > 0) {
ith.strata = study.pop[ith.sampling.id,]
ith.strata$add.id = 1:nrow(ith.strata)
ith.sampling.id = sample(1:nrow(ith.strata), min(ith.strata.size, nrow(ith.strata)))
ith.sample = ith.strata[ith.sampling.id, ]
strata.sample = rbind(strata.sample, ith.sample)
}
}
strat.sample.final = strata.sample
kable(head(strat.sample.final))
545534 |
5363724001 |
GOOD FEET STORE |
DENVER |
CO |
80231 |
WELLS FARGO BANK NATL ASSOC |
MN |
448210 |
26-Apr-02 |
2002 |
72 |
6 |
1 |
4 |
6 |
1 |
1 |
0 |
N |
|
30-Jun-02 |
130000 |
0 |
P I F |
0 |
130000 |
110500 |
Midwest |
High |
545534 |
168642 |
838741 |
9008854009 |
JERRE ALLYN LLC |
DENVER |
CO |
80202 |
WELLS FARGO BANK NATL ASSOC |
SD |
522310 |
27-Jul-05 |
2005 |
12 |
2 |
1 |
2 |
2 |
1 |
1 |
Y |
N |
6-Feb-10 |
31-Aug-05 |
60103 |
0 |
CHGOFF |
25596 |
30000 |
15000 |
Midwest |
Low |
838741 |
250316 |
699908 |
7286233005 |
SUNCREST EXTERMINATING, INC. |
ANAHEIM |
CA |
92806 |
WELLS FARGO BANK NATL ASSOC |
SD |
0 |
13-Jul-94 |
1994 |
300 |
32 |
1 |
0 |
0 |
1 |
0 |
N |
N |
|
31-Oct-94 |
440000 |
0 |
P I F |
0 |
440000 |
308000 |
Midwest |
Very High |
699908 |
208883 |
815096 |
8754793005 |
BOAT LIFT MARINE CENTER |
OSAGE BEACH |
MO |
65065 |
CENTRAL BK OF LAKE OF OZARKS |
MO |
333923 |
31-Oct-95 |
1996 |
120 |
6 |
1 |
0 |
0 |
1 |
0 |
N |
N |
|
31-Jan-96 |
285000 |
0 |
P I F |
0 |
285000 |
213750 |
Midwest |
High |
815096 |
242436 |
812631 |
8727094008 |
KING’S UPHOLSTERY |
JOELTON |
TN |
37080 |
U.S. BANK NATIONAL ASSOCIATION |
OH |
811420 |
23-Mar-05 |
2005 |
84 |
3 |
1 |
0 |
3 |
1 |
1 |
Y |
N |
|
30-Apr-05 |
24584 |
0 |
P I F |
0 |
10000 |
5000 |
Midwest |
Very Low |
812631 |
241609 |
298982 |
3069575000 |
DAVIDOVICH BAGEL & LOX FACTORY |
JAMAICA |
NY |
11435 |
JPMORGAN CHASE BANK NATL ASSOC |
IL |
311812 |
29-Apr-08 |
2008 |
84 |
9 |
2 |
1 |
9 |
0 |
1 |
N |
N |
|
31-May-08 |
101300 |
0 |
P I F |
0 |
101300 |
50650 |
Midwest |
Medium |
298982 |
88047 |
Cluster Sampling
unique_zip_codes = unique(loan$Zip)
avg_loans_per_zip = mean(table(loan$Zip))
num_zip_codes = ceiling(32724 / avg_loans_per_zip)
set.seed(123)
sampled_zip_codes = sample(unique_zip_codes, size = num_zip_codes, replace = FALSE)
cluster_sample = loan[loan$Zip %in% sampled_zip_codes, ]
cluster_sample_size = nrow(cluster_sample)
cluster_sample_dim = dim(cluster_sample)
names(cluster_sample_dim) = c("Size", "Var.count")
kable(t(cluster_sample_dim))
cluster_sample_size
[1] 32305
For cluster sampling we select a random set of zip codes and includes
all loans from those zip codes to create a cluster sample. We first
calculate the average number of loans per zip code and determines how
many zip codes are needed to reach the target sample size of 32,724.
Finally we filter the dataset to keep only loans from the selected zip
codes and check the final sample size.
SBA_Appr Curves
ggplot(loan, aes(x = SBA_Appv, fill = GrAppv_Cat, color = GrAppv_Cat)) +
geom_density(alpha = 0.3) +
labs(title = "Density Curves of SBA Approval Amount by Loan Size",
x = "SBA Approval Amount (SBA_Appv)",
y = "Density") +
theme_minimal() +
theme(legend.title = element_blank()) +
coord_cartesian(xlim = c(0, 500000))

The code above shows a density plot of the distribution of SBA
approval amounts, categorized by loan size. From the visualization above
we can see which loan sizes are most common and helps us see if certain
loan categories are higher. The density peaks tell us where most SBA
approval amounts fall.
---
title: "EDA Bank Loan"
author: 'Tyler Battaglini'
date: "2025-3-2"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
always_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-weight: bold;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("nleqslv")) {
   install.packages("nleqslv")
   library(nleqslv)
}
#
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}

if (!require("psych")) {   
  install.packages("psych")
   library(psych)
}
if (!require("MASS")) {   
  install.packages("MASS")
   library(MASS)
}
if (!require("ggplot2")) {   
  install.packages("ggplot2")
   library(ggplot2)
}
if (!require("GGally")) {   
  install.packages("GGally")
   library(GGally)
}
if (!require("car")) {   
  install.packages("car")
   library(car)
}
if (!require("dplyr")) {   
  install.packages("dplyr")
   library(dplyr)
}
if (!require("caret")) {   
  install.packages("caret")
   library(caret)
}
if (!require("readxl")) {   
  install.packages("readxl")
   library(readxl)
}
if (!require("openxlsx")) {   
  install.packages("openxlsx")
   library(openxlsx)
}
if (!require("forecast")) {   
  install.packages("forecast")
   library(forecast)
}
if (!require("pwr")) {   
  install.packages("pwr")
   library(pwr)
}
# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      messages = FALSE,  #
                      results = TRUE,
                      
                      comment = NA       # you can also decide whether to include the output
                                         # in the output file.
                      )   
```


## Combined Data

```{r}
loan01 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)



```

## Mis_Status Missing

```{r}
loan <- loan[!is.na(loan$MIS_Status), ]

```

Mis_Status is the status of the persons loan. This is one of the most important variables in our dataset. It ensures us that when we do our analysis we have a complete observation telling us if the loan was approved, paid, or defaulted. So we remove any observations with loan status missing for clarity.

## Convert Variables

```{r}
currency_vars <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan[currency_vars] <- lapply(loan[currency_vars], function(x) as.numeric(gsub("[$,]", "", x)))

```

Since DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv are important measures for our analysis we want to be able to use them more methodically. So we convert these categorical variables to numeric variables and remove any character values. Working with numbers as opposed to charterers is better for statistical analysis, visualization, and overall modeling.
## Categorical Variable rework

```{r}
loan <- loan %>%
  mutate(BankRegion = case_when(
    BankState %in% c("CT", "ME", "MA", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    BankState %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    BankState %in% c("AL", "AR", "DE", "DC", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV") ~ "South",
    BankState %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    TRUE ~ "Unknown"
  ))

table(loan$BankRegion)
```

I decided to make 4 regional categories of the states. These categories being Northeast, Midwest, South, West and also an unknown. By doing this we can see if there are any patterns for any specific region. We can use this to compare approval rates, default rates, or average loan amounts by region. This also makes it easier for data visualization to again see if there is any disparities by region. 

## Default Rates based on Region

```{r}
default_rates <- loan %>%
  group_by(BankRegion) %>%
  summarise(
    Total_Loans = n(),
    Defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
    Default_Rate = Defaults / Total_Loans * 100
  )

print(default_rates)
```

This code calculates the default rate of SBA-backed loans for each region. With this code it helps us to compare which regions have higher or lower default rates. It also gives insight into regional economic trends and lending practices. We see that in the South the default rate is the highest at 21% meaning that more loans go unpaid here than in any other region. 

## Discretize GrApprv

```{r}
loan <- loan %>%
  mutate(GrAppv_Cat = cut(GrAppv, 
                          breaks = quantile(GrAppv, probs = seq(0, 1, by = 0.2), na.rm = TRUE),
                          include.lowest = TRUE,
                          labels = c("Very Low", "Low", "Medium", "High", "Very High")))

table(loan$GrAppv_Cat)
```

In the code above we categorize the SBA guaranteed approval amount into five groups. By categorizing and making this variable into small intervals we convert it from a continuous variable we now have a discretized variable. Discretized variables are easier to interpret. We see from the output above that they are evenly distributed meaning our approval spans a wide range of amounts. 

## Study Population

```{r}

study.pop <- loan %>%
  filter(BankRegion != "Unknown")

kable(t(table(study.pop$BankRegion)))

```

For the study population we remove the unknown group because there is only 1730 observations which is too small and does not match up with our other categories. We also cannot draw conclusions about the region if we do not know where these observations are coming from. So now we have defined our study population.

## Sample Calculation

```{r}
std_dev <- sd(loan$SBA_Appv, na.rm = TRUE)

Z_alpha <- 1.96  
Z_beta <- 0.84   
delta <- 5000    

n <- (2 * (Z_alpha + Z_beta)^2 * std_dev^2) / delta^2

n_rounded <- ceiling(n)

cat("The required sample size per group is:", n_rounded, "\n")

```

After making a calculation for our sample we get a sample size of 32724 which is exactly where we need the size to be. It is less than 5% of our dataset but is also a relatively big number.

## Simple Random Sample

```{r}
study.pop$sampling.frame = 1:length(study.pop$GrAppv)   

sampled.list = sample(1:length(study.pop$GrAppv), 32724) 

SRS.sample = study.pop[sampled.list,]                  

dimension.SRS = dim(SRS.sample)
names(dimension.SRS) = c("Size", "Var.count")
kable(t(dimension.SRS))

```

We performed a simple random sample. First we assigned a unique index to each observation using sampling.frame. Then we randomly selected 32,724 observations from the dataset without replacement. Finally we  extract the sampled observations into SRS.sample to make a table. 

## Systematic Sampling

```{r}
jump.size = dim(study.pop)[1]%/% 32724

rand.starting.pt=sample(1:jump.size,1)
sampling.id = seq(rand.starting.pt, dim(study.pop)[1], jump.size)

sys.sample=study.pop[sampling.id,]    

sys.Sample.dim = dim(sys.sample)
names(sys.Sample.dim) = c("Size", "Var.count")
kable(t(sys.Sample.dim))

```

We performed a systematic sample. First we calculated the jump size by dividing the total number of observations by 32,724, determining the interval at which samples will be selected. Then we randomly picked a starting point within the first interval. Using this starting point we generate a sequence of indices at regular intervals to select observations.



## Stratefied Sampling

```{r}
freq.table = table(study.pop$BankRegion)  
rel.freq = freq.table/sum(freq.table)  
strata.size = round(rel.freq*32724)     
strata.names=names(strata.size)        
kable(t(strata.size)) 

```

We performed a Stratified Sampling by first creating a frequency table of the BankRegion variable, which represents different categories. Then we calculate the relative frequency of each category by dividing the counts by the total number of observations. Next we determine the sample size for each categories by multiplying the relative frequencies by 32,724 and rounding the values.

```{r}
strata.sample = study.pop[0,]  

for (i in 1:length(strata.names)) {
   ith.strata.names = strata.names[i]  
   ith.strata.size = strata.size[i]   

   ith.sampling.id = which(study.pop$BankRegion == ith.strata.names) 
   
   if (length(ith.sampling.id) > 0 && ith.strata.size > 0) {
       ith.strata = study.pop[ith.sampling.id,] 
       ith.strata$add.id = 1:nrow(ith.strata)  
       
       ith.sampling.id = sample(1:nrow(ith.strata), min(ith.strata.size, nrow(ith.strata))) 
       ith.sample = ith.strata[ith.sampling.id, ]
       
       strata.sample = rbind(strata.sample, ith.sample) 
   }
}

strat.sample.final = strata.sample

kable(head(strat.sample.final))  

```

## Cluster Sampling

```{r}
unique_zip_codes = unique(loan$Zip)

avg_loans_per_zip = mean(table(loan$Zip))

num_zip_codes = ceiling(32724 / avg_loans_per_zip)

set.seed(123)
sampled_zip_codes = sample(unique_zip_codes, size = num_zip_codes, replace = FALSE)

cluster_sample = loan[loan$Zip %in% sampled_zip_codes, ]

cluster_sample_size = nrow(cluster_sample)
cluster_sample_dim = dim(cluster_sample)
names(cluster_sample_dim) = c("Size", "Var.count")

kable(t(cluster_sample_dim))

cluster_sample_size
```

For cluster sampling we select a random set of zip codes and includes all loans from those zip codes to create a cluster sample. We first calculate the average number of loans per zip code and determines how many zip codes are needed to reach the target sample size of 32,724. Finally we filter the dataset to keep only loans from the selected zip codes and check the final sample size.


## SBA_Appr Curves

```{r}
ggplot(loan, aes(x = SBA_Appv, fill = GrAppv_Cat, color = GrAppv_Cat)) +
  geom_density(alpha = 0.3) + 
  labs(title = "Density Curves of SBA Approval Amount by Loan Size",
       x = "SBA Approval Amount (SBA_Appv)", 
       y = "Density") +
  theme_minimal() +
  theme(legend.title = element_blank()) +  
  coord_cartesian(xlim = c(0, 500000))
```

The code above shows a density plot of the distribution of SBA approval amounts, categorized by loan size. From the visualization above we can see which loan sizes are most common and helps us see if certain loan categories are higher. The density peaks tell us where most SBA approval amounts fall. 





