The data I am using for this analysis contains information regarding bank loans provided in America. The goal of this analysis is to see which type of sampling is the most accurate and true to the original population. I am testing with random, systematic, and stratified sampling. To start, I re-categorized the variable State, so that there are only 7 different outcomes and not 50. I organized the states by region, following the trends of the map below.
First, I made sure that none of the regions contained a significantly small amount of the population, after combining the states together. After finding that none of the categories contains less then 10% of the population, I continued with my analysis.
| Population.size | Number.of.Regions | Sub.Pop.less.1000 |
|---|---|---|
| 899164 | 7 | 0 |
Now I start to take my sample through the method of simple random sampling. I want to get 4,000 random observations from the population, and print a table of my new sample to affirm that I got the 4,000 samples.
| Size | Var.count |
|---|---|
| 4000 | 28 |
For systematic sampling, I also asked for about 4,000 samples from the population, but because of the nature of the systematic sampling, I can expect that I may get a few samples more or a few samples less then what I am asking for. I printed another table to show how many observations were pulled from the popultion.
| Size | Var.count |
|---|---|
| 4014 | 28 |
For the stratified sampling technique, I again asked for 4,000 samples from the population, and printed a table showing how many samples ended up in each category, in proportion to how large the categories are in the population.
| MidAtlantic | Midwest | Northeast | Northwest | Southeast | Southwest | West |
|---|---|---|---|---|---|---|
| 571 | 571 | 571 | 571 | 571 | 571 | 571 |
First, I found the default rates for the population and then compared them to the default rates of the simple random sample,the systematic random sample, and the stratified random sample. Based on the table, the default rate for the systematic random sample is the closest to the population default rate, making it the best fit system for this data.
| default.rate.pop | default.rate.srs | default.rate.sys | default.rate.str | |
|---|---|---|---|---|
| Northwest | 17.5 | 17.9 | 17.8 | 50 |
| Midwest | 17.7 | 17.2 | 17.8 | 50 |
| MidAtlantic | 17.6 | 15.1 | 17.8 | 50 |
| Northeast | 17.6 | 20.0 | 17.8 | 50 |
| Southeast | 17.6 | 16.9 | 17.8 | 50 |
| West | 17.4 | 18.4 | 17.8 | 50 |
| Southwest | 17.5 | 15.6 | 17.8 | 50 |
After putting the output from the table into a graph, it s easy to see that the systematic sample follows the population perfectly, making it the best fit for this data.