Comparing Sampling Methods

Introduction

The data I am using for this analysis contains information regarding bank loans provided in America. The goal of this analysis is to see which type of sampling is the most accurate and true to the original population. I am testing with random, systematic, and stratified sampling. To start, I re-categorized the variable State, so that there are only 7 different outcomes and not 50. I organized the states by region, following the trends of the map below.

Methods

First, I made sure that none of the regions contained a significantly small amount of the population, after combining the states together. After finding that none of the categories contains less then 10% of the population, I continued with my analysis.

Population.size Number.of.Regions Sub.Pop.less.1000
899164 7 0

Simple Random Sampling

Now I start to take my sample through the method of simple random sampling. I want to get 4,000 random observations from the population, and print a table of my new sample to affirm that I got the 4,000 samples.

Size Var.count
4000 28

Systematic Sampling

For systematic sampling, I also asked for about 4,000 samples from the population, but because of the nature of the systematic sampling, I can expect that I may get a few samples more or a few samples less then what I am asking for. I printed another table to show how many observations were pulled from the popultion.

Size Var.count
4014 28

Stratified Sampling

For the stratified sampling technique, I again asked for 4,000 samples from the population, and printed a table showing how many samples ended up in each category, in proportion to how large the categories are in the population.

MidAtlantic Midwest Northeast Northwest Southeast Southwest West
571 571 571 571 571 571 571

Results

First, I found the default rates for the population and then compared them to the default rates of the simple random sample,the systematic random sample, and the stratified random sample. Based on the table, the default rate for the systematic random sample is the closest to the population default rate, making it the best fit system for this data.

A table now adding in sys comparing the default rates

Comparison of region-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
default.rate.pop default.rate.srs default.rate.sys default.rate.str
Northwest 17.5 17.9 17.8 50
Midwest 17.7 17.2 17.8 50
MidAtlantic 17.6 15.1 17.8 50
Northeast 17.6 20.0 17.8 50
Southeast 17.6 16.9 17.8 50
West 17.4 18.4 17.8 50
Southwest 17.5 15.6 17.8 50

Results in a Graph

After putting the output from the table into a graph, it s easy to see that the systematic sample follows the population perfectly, making it the best fit for this data.