Probability Sampling:
In probability sampling, we randomly select a group of the population that we want to do research on, this is also often referred to as random sampling. It is extremely important that this group is truly chosen at random, so we can make predictions most accurately on a population; if these groups are not truly random, we could end up with errors with results showing certain biases. There are four types of probability sampling; simple random sampling, stratified sampling, systematic sampling, and cluster sampling. The ultimate goal is to take our finding from our small sample group, and apply it to the much large population.
Non-Probability Sampling:
Non-Probability sampling is a separate method when we use non-random sampling methods. Since this is not random, not everyone has an equal chance of being selected for the sample and there is a higher chance of having bias in the sample. There are five common types of probability sampling; convenience sampling, quota sampling, self-selection (volunteer) sampling, snowball sampling, and purposive (judgmental) sampling.
Differences and when we should use each:
We should use Probability sampling when we are looking to get a much more accurate sample of an entire population. By taking our sample completely at random, it gives us the highest probability to represent a large population without bias. An example of this would be surveying 25 people at random if they liked a new Netflix show that recently came out.
We should use non-probability sampling when we want to sample a specific group sharing similar characteristics. With this method, it is much easier to find the correct target audience. An example of this would be sampling a group to see if we thought that the show Suits is an accurate representation of real life. To do so, we sample a group of attorneys to answer this. By doing so, our sample group is not random and likely provide a more expert opinion on this.
Simple Random Sampling:
This method is when we gather a random selection from the entire population where every unit has an equal chance of selection. This is the most commonly used method for selecting a random sample. An example of this would be researching political views.
Systematic Sampling:
This is when we take a random sample from the target population by selecting units at regular intervals starting from a random point. To calculate this we would divide our sampling frame into a number of segments which we list as intervals. We then select our intervals until we get the total number we want for our sample.
Non-Probability Sampling
Convenience Sampling:
Convenience sampling is simply defined as being convenient to the researcher. Examples given of this were ease of access, geographical proximity, and existing contact with the population of interest. There is something important to know with this method, however; sometimes these samples can be referred to as accidental samples since participants can sometimes be selected simply because they happen to be near the researcher.
Self-Selection Sampling:
Also know as volunteer sampling, Self-selection sampling is when participants volunteer to be apart of the research sample. The most common example is when there is a specific criteria that must be met by our sample, often times being in medical or psychological research. Volunteers usually sign up for these through ads or are recruited.
In particular, CPS is a sample surveyLinks to an external site. of about 60,000 eligible households (in 2017) scientifically selected to reflect the entire U.S. civilian noninstitutional population. On the basis of responses to a series of questions on work and job search activities, each person 16 years and over in a sample household is classified as employed, unemployed, or not in the labor force.
The population of interest in the Current Population Survey is the entire United States civilian non-institutional population. The age criteria were those over 16 years of age. The occupational criteria are those who are employed and unemployed . This survey was done for the entire United States.
The amount of households in the monthly sample was just over 65,000 per IPUMS. While we know the amount of households there are, we do not know exactly how many people there are since this would differ per household. I found it interesting seeing the way they broke up their finding into 10 year periods to be able to capture current events. By splitting this up, it allows us to reduce bias for certain factors that could be causing unemployment. For example, in the 1940’s WWII was going on and we can see it in the graph posted, this led to unemployment being at an all time high. Without looking at external economic events, it would lead to more questions than answers as to why certain time periods has such drastic spikes. The sample method used was stratified, multistage sampling which we know is a collection of random sampling from subgroups within a population. We know that for this study they took these samples from different areas in the country as well.
Yes, I do. Since we used stratified sampling, we know that this does a good job in limiting bias to get the best sense of the entire population, in this case, the entire US population. This methodology does so by looking at different geometric areas as well as account for other economic factors.
# Installed ipumsr
library(ipumsr)
ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
summary(data)
## YEAR SERIAL MONTH HWTFINL
## Min. :2009 Min. : 1 Min. : 1.000 Min. : 0
## 1st Qu.:2017 1st Qu.:19513 1st Qu.: 3.000 1st Qu.: 1567
## Median :2021 Median :38877 Median : 3.000 Median : 3381
## Mean :2019 Mean :40153 Mean : 4.807 Mean : 3105
## 3rd Qu.:2022 3rd Qu.:58851 3rd Qu.: 7.000 3rd Qu.: 4310
## Max. :2024 Max. :99461 Max. :12.000 Max. :20133
## NA's :2777271
## CPSID ASECFLAG HFLAG ASECWTH
## Min. :0.000e+00 Min. :1 Min. :0 Min. : 53
## 1st Qu.:2.013e+13 1st Qu.:1 1st Qu.:0 1st Qu.: 871
## Median :2.020e+13 Median :1 Median :0 Median : 1623
## Mean :1.730e+13 Mean :1 Mean :0 Mean : 1785
## 3rd Qu.:2.022e+13 3rd Qu.:1 3rd Qu.:1 3rd Qu.: 2301
## Max. :2.024e+13 Max. :2 Max. :1 Max. :28654
## NA's :3713050 NA's :6594980 NA's :4017265
## PERNUM WTFINL CPSIDP CPSIDV
## Min. : 1.000 Min. : 0 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 1.000 1st Qu.: 1572 1st Qu.:2.013e+13 1st Qu.:2.013e+14
## Median : 2.000 Median : 3380 Median :2.020e+13 Median :2.020e+14
## Mean : 2.166 Mean : 3184 Mean :1.730e+13 Mean :1.730e+14
## 3rd Qu.: 3.000 3rd Qu.: 4401 3rd Qu.:2.022e+13 3rd Qu.:2.022e+14
## Max. :16.000 Max. :44748 Max. :2.024e+13 Max. :2.024e+14
## NA's :2777271
## ASECWT LABFORCE INCWAGE
## Min. : 50 Min. :0.000 Min. : 0
## 1st Qu.: 882 1st Qu.:1.000 1st Qu.: 0
## Median : 1638 Median :1.000 Median : 30000
## Mean : 1828 Mean :1.293 Mean :22444229
## 3rd Qu.: 2389 3rd Qu.:2.000 3rd Qu.: 129000
## Max. :44424 Max. :2.000 Max. :99999999
## NA's :4017265 NA's :4017265
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data2 <- data %>% filter(INCWAGE != 99999999)
library(psych)
describe(data2$INCWAGE)
## vars n mean sd median trimmed mad min max range
## X1 1 2154587 30310.41 56835.37 12000 30310.41 17791.2 0 2099999 2099999
## skew kurtosis se
## X1 8.15 121.31 38.72
describe(data2$LABFORCE)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2154587 1.62 0.49 2 1.62 0 0 2 2 -0.62 -1.34 0
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data = data2,
mapping = aes(x = LABFORCE ,
y = INCWAGE )) + geom_point()