Comparing Sampling Methods

Introduction

The data I am using for this analysis contains information regarding bank loans provided in America. The goal of this analysis is to see which type of sampling is the most accurate and true to the original population. I am testing with random, systematic, and stratified sampling. To start, I re-categorized the variable State, so that there are only 7 different outcomes and not 50. I organized the states by region, following the trends of the map below.

Methods

First, I made sure that none of the regions contained a significantly small amount of the population, after combining the states together. After finding that none of the categories contains less then 10% of the population, I continued with my analysis.

Population.size	Number.of.Regions	Sub.Pop.less.1000
899164	7	0

Simple Random Sampling

Now I start to take my sample through the method of simple random sampling. I want to get 4,000 random observations from the population, and print a table of my new sample to affirm that I got the 4,000 samples.

Size	Var.count
4000	28

Systematic Sampling

For systematic sampling, I also asked for about 4,000 samples from the population, but because of the nature of the systematic sampling, I can expect that I may get a few samples more or a few samples less then what I am asking for. I printed another table to show how many observations were pulled from the popultion.

Size	Var.count
4014	28

Stratified Sampling

For the stratified sampling technique, I again asked for 4,000 samples from the population, and printed a table showing how many samples ended up in each category, in proportion to how large the categories are in the population.

MidAtlantic	Midwest	Northeast	Northwest	Southeast	Southwest	West
571	571	571	571	571	571	571

Results

First, I found the default rates for the population and then compared them to the default rates of the simple random sample,the systematic random sample, and the stratified random sample. Based on the table, the default rate for the systematic random sample is the closest to the population default rate, making it the best fit system for this data.

A table now adding in sys comparing the default rates

Comparison of region-specific default rates between population, SRS, Systematic Sample, and Stratified Samples.
	default.rate.pop	default.rate.srs	default.rate.sys	default.rate.str
Northwest	17.5	17.9	17.8	50
Midwest	17.7	17.2	17.8	50
MidAtlantic	17.6	15.1	17.8	50
Northeast	17.6	20.0	17.8	50
Southeast	17.6	16.9	17.8	50
West	17.4	18.4	17.8	50
Southwest	17.5	15.6	17.8	50

Results in a Graph

After putting the output from the table into a graph, it s easy to see that the systematic sample follows the population perfectly, making it the best fit for this data.

Assignment #4

Mikaela Taylor

10/6/2022