Fiscal Year of Approval

.title[
# Fiscal Year of Approval
]
.author[
### By: Evan Parker 
]
.institute[
### West Chester University of Pennsylvania 
]
.date[
### 10/22/2022 Prepared for STA 490: Capstone Statistics 
]

---

## <center>Data Set Description & Startification Variable Choice</center>

The bank loan data set was provided by the U.S. Small Business Administration (SBA) which contains all historical loans endorsed by SBA from 1987 through 2014.

- 899164 observations
- 27 variables

Stratification Variable: **ApprovalFY** -  Fiscal Year of approval for the loan.

Below is a breakdown of the counts for each approval year.

---

## <center>ApprovalFY Combination</center>

Due to categories with low observations, I combine categories before *1990* into a new category named *1989*

**Study Population now defined!**

- 899164 observations
- 29 variables (+2)
- Across 26 years (instead of 54)

---

## <center>Technique #1: Simple Random Sampling</center>

**Simple Random Sampling** (SRS) involves giving every observation a number from 1 through your population size `\(N\)`, then randomly picking `\(n\)` (your ideal sample size) numbers. We must assume that every possible combination of `\(n\)` data points have an equally likely probability to be selected on the random sample

**For this Study:**

- SRS of 4000 observations are collected

---

## <center>Technique #2: Systematic Sampling</center>

**Systematic Sampling** utilizes a **jump size** `\(m\)`. You find your jump size by taking your population size, `\(N\)`, and dividing it by your sample size `\(n\)`. That is `\(m \approx N/n\)`. Then, you pick a random number lower than your `\(m\)` value, and then take every "m-th" observation after that. The systematic sampling plan is a valid choice because the starting point is chosen at random.

**For this Study:**

- Our `\(m\)` value is: 899164/400 = 224.791 `\(\approx\)` 224
- Random Number chosen: 70
- Every 224-th observation is chosen: 70, 294, 518, 742, ...
- 4014 observations are collected

---

## <center>Technique #3: Straitified Sampling</center>

**Stratified Sampling** is an alternative to SRS and involves splitting up the population via a **stratification variable**. Then, we take samples from each corresponding group. One important notion about stratified sampling is that the acquired sample must be proportional to the given population.

**For this study:**

- Stratificaion Variable: **AprovalFY**
    - Group by year, take random sample

---

## <center>Performance Analysis</center>

- The sample default rates for some years vary between the sampling methods.

- The default rates per sampling method are accurate during some years and are inaccurate during others.

- There are some years that are not represented in the samples due to randomness.

---

## <center>Performance Analysis Graph</center>

The patterns of year-specific default rates are also reflected in the above line plot.

---

## <center>Mean Square Error Graph</center>

To see the overall performance among the three sampling plans based on the single-step samples, I look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The results are summarized in the graph above, which shows Stratified and SRS significantly outperforms the systematic sampling plan.

**Important Note:** The pattern observed above about the discrepancy of population and sample rates could be changed significantly across the samples.

---

## <center>Conclusion</center>

**General key takeaways:**

- **Simple Random Sampling** is when you assign each observations a number and randomly choose numbers and sample the observations associated with each number
- **Systematic Sampling** is when you calculate a jump size `\(m\)` and take every "m-th" observation after a randomly chosen starting point
- **Stratified Sampling** is when you split the population into groups based off of a stratification variable and then sample from the made groups

**Takeaways from the Study:**

- Combining categories of **ApprovalFY** helped reduce sampling errors 
- **Stratified Sampling** worked best
    - Might differ if different samples are chosen
- Interactions between variables could alter outcomes

### <center>Any Questions?</center>