1 Table of Contents

  1. Introduction
  2. Pre-Analysis
  3. Simple Random Sampling
  4. Systematic Random Sampling
  5. Stratified Random Sampling
  6. Results
  7. Results in a Graph

2 Introduction

The goal of this analysis is to see which type of sampling method provides a sample which remains the most true to the original population. The data being used for this analysis contains information regarding bank loans provided in America between the years 1987 and 2014. There are 899,164 observations in the data, and the variables vary from what amount of money was disbursed in the loan to the number of business employees the small business has that is receiving the loan. The tested methods include simple random, systematic, and stratified sampling. With all of these samples, the variable “States” will be re-categorized, so that there are only 7 different categories and not 50. They will be organized by region, following the commonly accepted trends of the map below. The reasoning for this re-categorization is in effort to cut down the different responses in the State variable, providing information that makes for a logical and more concise conclusion.

US Map

3 Pre-Analysis

The first thing to be done is to make sure that none of the regions contain a significantly small amount of the population, after combining the states together. After finding that none of the categories contain less then 10% of the population, the analysis continues with the simple random sampling, systematic random sampling, and stratified random sampling.

Population.size Number.of.Regions Sub.Pop.less.1000
899164 7 0

4 Simple Random Sampling

The simple random sampling method is the simplest of the three sampling methods. It consists of a subset chosen from a larger set in which a subset of individuals are chosen randomly, all with the same probability.

Simple Random Sampling
With the help of the R language, a random generator is used to randomly choose a sample size of the analyst’s choice. Wanting to get 4,000 random observations from the population, and affirm that there are indeed 4,000 samples within the sample, a table is printed showing how many observations are in the population and in the new sample.

Size Var.count
4000 28

5 Systematic Sampling

Systematic sampling is a sampling in which a random sample, with a fixed periodic interval, is selected from a larger population. There is what is called a jump size which determines how many observations will be skipped over until the next observation is chosen from the population. Typically this is determined by the amount of observations in the population and how large one wants the sample size to be.

Systematic Sampling
For systematic sampling, there were also about 4,000 samples called from the population, but because of the nature of the systematic sampling, it can be expected that the sample may get a few observations more or a few observations less then what is asked for. Another table was printed to show how many observations were pulled from the population. The first observation chosen to start off the sample is observation 107, and this was determined by a random generator in R. The jump size for this sample is 224, so this means 224 observations are being skipped over every time a new observation is chosen to be a part of the new sample.

Size Var.count
4014 28

6 Stratified Sampling

Stratified sampling involves the division of a population into smaller subgroups known as strata, based on members’ shared attributes or characteristics, then independently sample each strata randomly. In this case the strata are defined by the region in which the State falls into. This makes State the stratified variable, the variable in which the observations are categorized by. The variable state was chosen due to the interest in what impact geographical location could have on bank loans. Stratifying the data makes sure there is a proportional representation of the data within these regions, making it easier to make conclusions based off of the regions. It can also help highlight trends that may be occurring, unlike un-proportioned sampling such as simple random sampling where trends can be missed.

Stratified Sampling
For the stratified sampling technique 4,000 samples from the population were called, and a printed a table was created showing how many samples ended up in each category, in proportion to how large the categories are in the population. The table shows that an even amount of observations were put into each region category, reflecting the even proportionality of the population. As stated previously, the variable State is the stratified variable for this sampling, in hopes the data on bank loans will be better represented.

MidAtlantic Midwest Northeast Northwest Southeast Southwest West
571 571 571 571 571 571 571

7 Results

To compare the sampling methods, the default rate for each sample is compared. A default rate is the percentage of loans that a lender has labeled unpaid after long period of time with missed payments. The default rates of each sampling method as well as the population were all compared in the table below. Based on the table, the default rate for the systematic random sample is the closest to the population default rate, making it the best fit to represent this data. Notice the large default rates for the stratified sampling method. Due to the extreme difference in values from the other samples, it can be ruled out that a sample like this was collected by chance. To understand why these default rates are so much higher would take more investigation into the specific samples in this sample, and the sampling method itself.

8 Results in a Graph

After putting the output from the table into a graph, it’s easy to see that the systematic sample follows the population perfectly, making it the best fit for this data. The systematic and the stratified samples both have a consistent default rate therefore there is no line showing any change in the graph below.