Sampling Plans and Analysis for Loan Data

class: center, middle, inverse, title-slide

.title[
# <font size="7" color="White">Sampling Plans and Analysis for Loan Data </font>
]
.author[
### <font size="5" color="White"> Jaiden Neff </font>
]
.institute[
### <font size="6" color="White">West Chester University of Pennsylvania</font><br>
]

---

<h2 align="center"> Table of Contents</h2>
<BR>

.pull-left[
- Data Description 
- Data Cleaning
   - Description of FRR 
- Sampling Description
- Loan Default Rate 
- Study Population 
- Simple Random Sample  
- Systematic Sample
- Stratified Sample
- Custer Sample
- Default Rate 
- Plot/ Visual Comparison 
- Final Thoughts
]

---

<h2 align="center"> Data Description </h2>

![](datainfo1.png)

---
<h2 align="center"> Data Description </h2>

![](datainfo2.png) 
---

<h2 align="center"> Data Cleaning </h2>

- Removed all missing values from MIS_Status
- Converted Dollar amount to a numeric variable
- Wanted to make the info more useful in bank states
- Took all of the bank states and break them down into the category of Federal Reserve Region 
- Brake up allows us to look at 12 distinct categories instead of 50

---

<h2 align="center"> Federal Reserve Regions </h2>

Boston: Connecticut (CT), Maine (ME), Massachusetts (MA), New Hampshire (NH), Rhode Island (RI), Vermont (VT)

New York: New Jersey (NJ), New York (NY)

Philadelphia: Pennsylvania (PA)

Richmond: Delaware (DE), District of Columbia (DC), Maryland (MD), Virginia (VA), West Virginia (WV)

Atlanta: Alabama (AL), Florida (FL), Georgia (GA), North Carolina (NC), South Carolina (SC)

St. Louis: Kentucky (KY), Louisiana (LA), Mississippi (MS), Tennessee (TN)

Chicago: Illinois (IL), Indiana (IN), Iowa (IA), Michigan (MI), Minnesota (MN), Wisconsin (WI)

Dallas: Arkansas (AR), Missouri (MO), Oklahoma (OK), Texas (TX)

Kansas City: Colorado (CO), Montana (MT), Utah (UT), Wyoming (WY), Ohio (OH), North Dakota (ND), South Dakota (SD), Nebraska (NE), Kansas (KS)

San Francisco: Arizona (AZ), California (CA), Hawaii (HI), Nevada (NV)

Seattle: Alaska (AK), Idaho (ID), Oregon (OR), Washington (WA)

Other: States not listed in the above breakdown

---

<h2 align="center"> Sampling Type Description </h2>

#### Simple Random Sampling (SRS):

- Equal Chance of selection
- Each selection is independent from each other

Why use it: 
- Ensures equal chance of selection 
- Avoid bias more representative of population as a whole

#### Systematic Sampling:

- Involves selecting every k-th element from a list or sampling frame
- (k) is determined based on the desired sample size and the size of the population
 
Why use it:
- Simple and more effective than SRS, Ensures sample is spread evenly across population
- It's useful when there's a list or sequence of elements from which to sample, such as customer lists or patient records

---

<h2 align="center"> Sampling Type Description </h2>

#### Stratified Sampling:

- The population is divided into distinct subgroups or strata based on certain characteristics
- Samples are then independently drawn from each stratum

Why use it: 
- Ensures that each subgroup is proportionally represented in the sample 
- Improve the precision of estimates, especially when certain subgroups are underrepresented

#### Cluster Sampling:

- The population is divided into clusters or groups, A random sample of clusters is selected
- All individuals within the selected clusters are included in the sample
 
Why use it: 
- Can be more practical and cost-effective when it's difficult or impractical to sample individuals directly
- Useful when the population is geographically dispersed 
- Useful when it's easier to access clusters rather than individuals directly.

---

<h2 align="center"> Sampling Visual </h2>

Here you can see a better representation of what happens in each sampling method.

![](SamplingPlans.png)

---

<h2 align="center"> Loan Default Rates </h2>

Here We find the loan default rates by region defined by the stratification variable Federal Reserve region. The loan default status can be defined by the variable MIS_Status.

|              |FederalReserveRegion | NoDefault| Default| DefaultRate|
|:-------------|:--------------------|---------:|-------:|-----------:|
|Atlanta       |Atlanta              |     64135|   24604|        27.7|
|Boston        |Boston               |     46260|    9476|        17.0|
|Chicago       |Chicago              |     71433|   17331|        19.5|
|Dallas        |Dallas               |     41003|    7529|        15.5|
|Kansas City   |Kansas City          |     93528|   18230|        16.3|
|New York      |New York             |     26257|    5939|        18.4|
|Other         |Other                |      3588|     352|         8.9|
|Philadelphia  |Philadelphia         |      9893|     929|         8.6|
|Richmond      |Richmond             |     30785|   14272|        31.7|
|San Francisco |San Francisco        |     62159|   19218|        23.6|
|Seattle       |Seattle              |     15154|    2883|        16.0|
|St. Louis     |St. Louis            |     11729|    1812|        13.4|

---

<h2 align="center"> Study Population </h2>

Here the categories "Other" and "Philadelphia" are removed because they are much smaller in comparison to the other categories and the default rates show us that it could create an issue later

| Atlanta| Boston| Chicago| Dallas| Kansas City| New York| Richmond| San Francisco| Seattle| St. Louis|
|-------:|------:|-------:|------:|-----------:|--------:|--------:|-------------:|-------:|---------:|
|   88739|  55736|   88764|  48532|      111758|    32196|    45057|         81377|   18037|     13541|

---

<h2 align="center"> Simple Random Sample </h2>

Here we take the simple random sample of the study population and are given a random sample of 4000 loan applications. We see there is a varaince of 31

| Size| Var.count|
|----:|---------:|
| 4000|        31|

---

<h2 align="center"> Systematic Sample </h2>

Similarly we take a systematic sample and get a sample size of 4017 with a variance of 31 after using the systematic sampling method.

| Size| Var.count|
|----:|---------:|
| 4026|        31|

---

<h2 align="center"> Stratified Sample </h2>

Here we take a stratified sample using the Federal reserve region that was added to this data set. We are able to see the break down of a sample of 4000 using the federal reserve regions to explain the data

| Atlanta| Boston| Chicago| Dallas| Kansas City| New York| Richmond| San Francisco| Seattle| St. Louis|
|-------:|------:|-------:|------:|-----------:|--------:|--------:|-------------:|-------:|---------:|
|     608|    382|     608|    333|         766|      221|      309|           558|     124|        93|

```
## 'data.frame':    583737 obs. of  31 variables:
##  $ VAR1                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LoanNr_ChkDgt       : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
##  $ Name                : chr  "ABC HOBBYCRAFT" "LANDMARK BAR & GRILLE (THE)" "WHITLOCK DDS, TODD M." "BIG BUCKS PAWN & JEWELRY, LLC" ...
##  $ City                : chr  "EVANSVILLE" "NEW PARIS" "BLOOMINGTON" "BROKEN ARROW" ...
##  $ State               : chr  "IN" "IN" "IN" "OK" ...
##  $ Zip                 : int  47711 46526 47401 74012 32801 6062 7083 34491 32456 6073 ...
##  $ Bank                : chr  "FIFTH THIRD BANK" "1ST SOURCE BANK" "GRANT COUNTY STATE BANK" "1ST NATL BK & TR CO OF BROKEN" ...
##  $ BankState           : chr  "OH" "IN" "IN" "OK" ...
##  $ NAICS               : int  451120 722410 621210 0 0 332721 0 811118 721310 0 ...
##  $ ApprovalDate        : chr  "28-FEB-1997" "28-FEB-1997" "28-FEB-1997" "28-FEB-1997" ...
##  $ ApprovalFY          : int  1997 1997 1997 1997 1997 1997 1980 1997 1997 1997 ...
##  $ Term                : int  84 60 180 60 240 120 45 84 297 84 ...
##  $ NoEmp               : int  4 2 7 2 14 19 45 1 2 3 ...
##  $ NewExist            : int  2 2 1 1 1 1 2 2 2 2 ...
##  $ CreateJob           : int  0 0 0 0 7 0 0 0 0 0 ...
##  $ RetainedJob         : int  0 0 0 0 7 0 0 0 0 0 ...
##  $ FranchiseCode       : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ UrbanRural          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RevLineCr           : chr  "N" "N" "N" "N" ...
##  $ LowDoc              : chr  "Y" "Y" "N" "Y" ...
##  $ ChgOffDate          : chr  "" "" "" "" ...
##  $ DisbursementDate    : chr  "28-FEB-1999" "31-MAY-1997" "31-DEC-1997" "30-JUN-1997" ...
##  $ MIS_Status          : chr  "P I F" "P I F" "P I F" "P I F" ...
##  $ nDisbursementGross  : int  60000 40000 287000 35000 229000 517000 600000 45000 305000 70000 ...
##  $ nBalanceGross       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nChgOffPrinGr       : int  0 0 0 0 0 0 208959 0 0 0 ...
##  $ nGrAppv             : int  60000 40000 287000 35000 229000 517000 600000 45000 305000 70000 ...
##  $ nSBA_Appv           : int  48000 32000 215250 28000 229000 387750 499998 36000 228750 56000 ...
##  $ FederalReserveRegion: chr  "Kansas City" "Chicago" "Chicago" "Dallas" ...
##  $ DefaultStatus       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ sampling.frame      : int  1 2 3 4 5 6 7 8 9 10 ...
```

|     | VAR1| LoanNr_ChkDgt|Name                           |City          |State |   Zip|Bank                       |BankState |  NAICS|ApprovalDate | ApprovalFY| Term| NoEmp| NewExist| CreateJob| RetainedJob| FranchiseCode| UrbanRural|RevLineCr |LowDoc |ChgOffDate |DisbursementDate |MIS_Status | nDisbursementGross| nBalanceGross| nChgOffPrinGr| nGrAppv| nSBA_Appv|FederalReserveRegion | DefaultStatus| sampling.frame| add.id|
|:----|----:|-------------:|:------------------------------|:-------------|:-----|-----:|:--------------------------|:---------|------:|:------------|----------:|----:|-----:|--------:|---------:|-----------:|-------------:|----------:|:---------|:------|:----------|:----------------|:----------|------------------:|-------------:|-------------:|-------:|---------:|:--------------------|-------------:|--------------:|------:|
|2420 | 2429|    1016255006|Jacob The Aramaic Holdings, In |CLEARWATER    |FL    | 33764|BANK OF AMERICA NATL ASSOC |NC        | 445120|22-JUN-2004  |       2004|   29|     3|        2|         0|           0|             1|          1|N         |N      |20-Feb-09  |30-JUN-2004      |CHGOFF     |              10000|             0|          7931|   10000|      5000|Atlanta              |             1|           2353|    319|
|2458 | 2467|    1016465010|Consolidated Mortgage Group In |O FALLON      |IL    | 62269|BANK OF AMERICA NATL ASSOC |NC        | 522310|22-JUN-2004  |       2004|   84|     5|        1|         0|           0|             1|          1|Y         |N      |           |30-JUN-2007      |P I F      |              25000|             0|             0|   25000|     12500|Atlanta              |             1|           2389|    335|
|2597 | 2606|    1017355003|Linda Binegar d/b/a The Wooden |SMITHFIELD    |NC    | 27577|BANK OF AMERICA NATL ASSOC |NC        | 722211|24-JUN-2004  |       2004|   46|     1|        2|         0|           0|             1|          1|T         |N      |6-Mar-08   |31-OCT-2004      |CHGOFF     |              10000|             0|         10000|   10000|      5000|Atlanta              |             1|           2527|    395|
|2620 | 2629|    1017485007|Valerie Johnson Designs, Inc.  |ATHENS        |GA    | 30606|BANK OF AMERICA NATL ASSOC |NC        | 423990|25-JUN-2004  |       2004|   84|     1|        2|         0|           0|             1|          1|Y         |N      |           |31-JUL-2004      |P I F      |              17200|             0|             0|   10000|      5000|Atlanta              |             1|           2550|    404|
|3582 | 3594|    1024035010|Team Sales of North Georgia, I |JASPER        |GA    | 30143|BANK OF AMERICA NATL ASSOC |NC        | 323113|15-JUL-2004  |       2004|    2|     3|        2|         0|           0|             1|          1|Y         |N      |8-Jun-11   |31-JUL-2004      |CHGOFF     |              13532|             0|          2204|   10000|      5000|Atlanta              |             1|           3486|    799|
|4067 | 4080|    1027785007|J&J Autoparts, Inc.            |MIAMI SPRINGS |FL    | 33166|BANK OF AMERICA NATL ASSOC |NC        | 441310|26-JUL-2004  |       2004|   24|     3|        1|         0|           0|             1|          2|Y         |N      |30-Dec-09  |30-NOV-2004      |CHGOFF     |              91419|             0|         27612|   50000|     25000|Atlanta              |             1|           3966|   1018|

---

<h2 align="center"> Cluster Sample  </h2>

Here we are using the cluster sampling method we are using Zip codes to break up the data into clusters when we look we can see we created 29880 clusters.

```
## 1 - level Cluster Sampling design (with replacement)
## With (29880) clusters.
## svydesign(ids = ~Zip, data = loan2)
```

```
##           mean     SE
## nGrAppv 191544 893.47
```

### Verifying

To verify this is correct we look at the 29880 clusters with the amount of unique zip codes there are in this data set

```
## [1] 29880
```

---

<h2 align="center"> Default Rates  </h2>

|              | default.rate.pop| default.rate.srs| default.rate.sys| default.rate.str|
|:-------------|----------------:|----------------:|----------------:|----------------:|
|Atlanta       |             27.7|             28.7|             27.5|             27.0|
|Boston        |             17.0|             14.6|             15.6|             17.5|
|Chicago       |             19.5|             19.7|             16.9|             17.6|
|Dallas        |             15.5|             13.4|             17.2|             12.3|
|Kansas City   |             16.3|             14.9|             18.3|             15.0|
|New York      |             18.4|             17.0|             17.6|             15.4|
|Richmond      |             31.7|             32.7|             29.3|             31.1|
|San Francisco |             23.6|             22.0|             23.8|             22.2|
|Seattle       |             16.0|             14.9|             19.0|             12.9|
|St. Louis     |             13.4|             15.2|             18.4|             15.1|
---

<h2 align="center"> Plotting the Default Rates </h2>

---

<h2 align="center"> Final Thoughts </h2>

After reviewing the data and the different sampling methods I think that though the sampling methods are all effective, the stratified sample and the simple random sample have closer default rates to the total population overall. We can see in the graphic that the stratified sample follows the line more closely then the other sampling methods. For this reason and for the proportion that is left up to chance when using SRS I think that a stratified sample would be the most effective to use in our loan study.

---

<h2 align="center"> Thank you! </h2>