class: center, middle, inverse, title-slide .title[ #
Fiscal Year of Approval
] .author[ ###
By: Evan Parker
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
10/22/2022
Prepared for
STA 490: Capstone Statistics
] --- ## <center><b><font color = purple>Data Set Description & Startification Variable Choice</font></b></center> The bank loan data set was provided by the U.S. Small Business Administration (SBA) which contains all historical loans endorsed by SBA from 1987 through 2014. - 899164 observations - 27 variables Stratification Variable: **ApprovalFY** - Fiscal Year of approval for the loan. Below is a breakdown of the counts for each approval year. <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/ApprovalFY%20Table%201.png" width="75%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/ApprovalFY%20Table%202.png" width="75%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/ApprovalFY%20Table%203.png" width="75%" style="display: block; margin: auto;" /> --- ## <center><b><font color = purple>ApprovalFY Combination</font></b></center> Due to categories with low observations, I combine categories before *1990* into a new category named *1989* **Study Population now defined!** <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Combined%20ApprovalFY%20Table%201.png" width="75%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Combined%20ApprovalFY%20Table%202.png" width="75%" style="display: block; margin: auto;" /> - 899164 observations - 29 variables (+2) - Across 26 years (instead of 54) --- class: inverse ## <center><b><font color = gold>Technique #1: Simple Random Sampling</font></b></center> **Simple Random Sampling** (SRS) involves giving every observation a number from 1 through your population size `\(N\)`, then randomly picking `\(n\)` (your ideal sample size) numbers. We must assume that every possible combination of `\(n\)` data points have an equally likely probability to be selected on the random sample **For this Study:** - SRS of 4000 observations are collected --- class: inverse ## <center><b><font color = gold>Technique #2: Systematic Sampling</font></b></center> **Systematic Sampling** utilizes a **jump size** `\(m\)`. You find your jump size by taking your population size, `\(N\)`, and dividing it by your sample size `\(n\)`. That is `\(m \approx N/n\)`. Then, you pick a random number lower than your `\(m\)` value, and then take every "m-th" observation after that. The systematic sampling plan is a valid choice because the starting point is chosen at random. **For this Study:** - Our `\(m\)` value is: 899164/400 = 224.791 `\(\approx\)` 224 - Random Number chosen: 70 - Every 224-th observation is chosen: 70, 294, 518, 742, ... - 4014 observations are collected --- class: inverse ## <center><b><font color = gold>Technique #3: Straitified Sampling</font></b></center> **Stratified Sampling** is an alternative to SRS and involves splitting up the population via a **stratification variable**. Then, we take samples from each corresponding group. One important notion about stratified sampling is that the acquired sample must be proportional to the given population. **For this study:** - Stratificaion Variable: **AprovalFY** - Group by year, take random sample --- ## <center><b><font color = purple>Performance Analysis</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Performance%20Analysis%20Table%20-%20Copy.png" style="display: block; margin: auto;" /> - The sample default rates for some years vary between the sampling methods. - The default rates per sampling method are accurate during some years and are inaccurate during others. - There are some years that are not represented in the samples due to randomness. --- ## <center><b><font color = purple>Performance Analysis Graph</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/ApprovalFY%20Graph.png" style="display: block; margin: auto;" /> The patterns of year-specific default rates are also reflected in the above line plot. --- ## <center><b><font color = purple>Mean Square Error Graph</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/MSE%20Graph.png" style="display: block; margin: auto;" /> To see the overall performance among the three sampling plans based on the single-step samples, I look at the mean square errors of the differences in the default rates between the population and each of the three random samples. The results are summarized in the graph above, which shows Stratified and SRS significantly outperforms the systematic sampling plan. **Important Note:** The pattern observed above about the discrepancy of population and sample rates could be changed significantly across the samples. --- class: inverse ## <center><b><font color = gold>Conclusion</font></b></center> **General key takeaways:** - **Simple Random Sampling** is when you assign each observations a number and randomly choose numbers and sample the observations associated with each number - **Systematic Sampling** is when you calculate a jump size `\(m\)` and take every "m-th" observation after a randomly chosen starting point - **Stratified Sampling** is when you split the population into groups based off of a stratification variable and then sample from the made groups **Takeaways from the Study:** - Combining categories of **ApprovalFY** helped reduce sampling errors - **Stratified Sampling** worked best - Might differ if different samples are chosen - Interactions between variables could alter outcomes ### <center><b><font color = gold>Any Questions?</font></b></center>