Sampling methods allow us to select a subset of individuals, or a sample, from a larger population, so that we may make inferences about the entire population. Probability and non-probability are two methods of sampling. Probability sampling involves selecting samples so that every individual in a population has an equal change of selection. This helps us avoid bias in our results, more so than non-probability sampling, but not completely. Probability sampling includes simple random sampling random sample, cluster sampling, systematic sampling, and stratified random sampling.
In non-probability sampling, every individual does not have an equal chance of selection. In non-probability sampling, we may make generalizations about a population, therefore it’s less reliable than probability sampling. Though, it tends to be more cost effective. Non-probability sampling includes convenience sampling, snowball sampling, and quota sampling.
Probability sampling i a better method as it can help reduce bias, though not completely, but it helps ensure our sample is chosen at random. This method can also help ensure the sample is more representative of the population. Probability sampling can be expensive and time consuming though, as it takes having access to large amounts of data and applying methodologies to help ensure the subjects are chosen at random. That said, it depends on your research design and what you’re trying to achieve.
Systematic sampling is a type of probability sampling that draws a random sample from the target population by selecting units at regular intervals, starting from a random point. This is a good method to use when there are existing records of our target population, like a company’s employment records or list of attendees at a concert. To conduct this sampling method, we divide our population into intervals/segments by our desired sample size. We select one unit from the first interval using simple random sampling, and then select the next unit from other intervals depending on the position of the unit selected in the first interval (i.e. 13th position).
Cluster sampling is a type of probability sampling, similar to stratified sampling, but in this method we divide our target population into groups/clusters. Then, randomly select a subsection of these groups/clusters to form our sample. It’s helpful when we want to study large, geographically dispersed populations where there are groups that exist in some way (e.g., classes in school, student orgs, departments at a company).
Convenience sampling is a type of non-probability sampling. It’s cost effective and quick. You choose individuals based on convenience and availability to participate in testing. For example, studying of people from your workplace on their thoughts about pay transparency.
Consecutive sampling is a type of non-probability sampling, where you choose a person or small group, conduct their research, analyze the results and gather more samples if necessary. For example, conducting studies at public locations, like a grocery store or mall, distributing surveys to people walking by.
The population of interest here is the entire U.S. civilian non-institutional population, over 16 years of age. It excludes active duty military, people confined to an institution like jails or nursing homes. It does include citizens of foreign countries who reside in the US but do not live in an embassy. It includes people who are classified as employed and unemployed.
The sample survey is about 60,000 eligible households (in 2017) scientifically selected. The CPS uses a monthly survey asking questions to this sample about work and job search activities. Each person 16 years and over in a sample household is classified as employed, unemployed, or not in the labor force. The CPS uses multi-stage stratified samples. The first stage involves dividing each U.S. state into primary sampling units, most of which consist of a city, large county, or group of smaller adjacent counties. Within each each these units are grouped into homogeneous strata with respect to labor force and other socio-economic characteristics correlated with unemployment. One unit is sampled per stratum where probability of selection for each unite is proportional to its population. In the second stage, a systemic sample of housing units is drawn from within each chosen unit. Ultimate sampling units are clusters of about 4 housing units. Occasionally, a third stage of sampling is necessary when actual ultimately unite size is very large. This multi-stage, stratified sampling method is about equivalent to dividing the entire US into ultimate units and selecting a clustered sample of these.
The CPS contains a monthly sample of 60,000 households. The number of people depends on the number of people in the household. What’s interesting about this methodology is the multi-stage approach to sampling. Due to the diverse makeup of the US, it makes sense that we’d need to collect data from different types of areas, thus dividing the units into cities, large counties, and aggregated groups of smaller counties. Additionally, grouping further based on socioeconomic criteria makes sense so one group (say wealthier or poorer individuals) don’t bias the results. What’s also interesting is that households are rotated into the survey, so a household is interviewed for 4 successive months, then not interviewed for 8 months, then returned to the sample fo 4 months. I imagine this helps reduce bias, where we would be getting the same answers if we kept interviewing the same households.
Yes, the CPS does appear to be a representative sample of the US population due to its multi-stage, stratified, clustered approach to capture different geographic areas and socio-economic factors.
#install.packages("ipumsr")
library("ipumsr")
setwd("/Users/ginaocchipinti/Documents/ADEC 7310 Data Analytics/Week 4")
ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
summary(data)
## YEAR SERIAL MONTH HWTFINL
## Min. :2009 Min. : 1 Min. : 1.000 Min. : 0
## 1st Qu.:2017 1st Qu.:19562 1st Qu.: 3.000 1st Qu.: 1575
## Median :2021 Median :38965 Median : 3.000 Median : 3404
## Mean :2019 Mean :40246 Mean : 4.876 Mean : 3120
## 3rd Qu.:2022 3rd Qu.:58959 3rd Qu.: 7.000 3rd Qu.: 4330
## Max. :2024 Max. :99461 Max. :12.000 Max. :20133
## NA's :2777271
## CPSID ASECFLAG HFLAG ASECWTH
## Min. :0.000e+00 Min. :1 Min. :0 Min. : 53
## 1st Qu.:2.013e+13 1st Qu.:1 1st Qu.:0 1st Qu.: 871
## Median :2.021e+13 Median :1 Median :0 Median : 1623
## Mean :1.724e+13 Mean :1 Mean :0 Mean : 1785
## 3rd Qu.:2.022e+13 3rd Qu.:1 3rd Qu.:1 3rd Qu.: 2301
## Max. :2.024e+13 Max. :2 Max. :1 Max. :28654
## NA's :3592239 NA's :6474169 NA's :3896454
## PERNUM WTFINL CPSIDP CPSIDV
## Min. : 1.000 Min. : 0 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 1.000 1st Qu.: 1581 1st Qu.:2.013e+13 1st Qu.:2.013e+14
## Median : 2.000 Median : 3402 Median :2.021e+13 Median :2.021e+14
## Mean : 2.167 Mean : 3200 Mean :1.724e+13 Mean :1.724e+14
## 3rd Qu.: 3.000 3rd Qu.: 4422 3rd Qu.:2.022e+13 3rd Qu.:2.022e+14
## Max. :16.000 Max. :44748 Max. :2.024e+13 Max. :2.024e+14
## NA's :2777271
## ASECWT LABFORCE INCWAGE
## Min. : 50 Min. :0.000 Min. : 0
## 1st Qu.: 882 1st Qu.:1.000 1st Qu.: 0
## Median : 1638 Median :1.000 Median : 30000
## Mean : 1828 Mean :1.292 Mean :22444229
## 3rd Qu.: 2389 3rd Qu.:2.000 3rd Qu.: 129000
## Max. :44424 Max. :2.000 Max. :99999999
## NA's :3896454 NA's :3896454
It does appear that labor force participation varies by income and wage. As people earn more money, they have a higher labor force participation rate. This makes sense as higher incomes/wages generally demand working to obtain them.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data1 <- data %>% filter(INCWAGE != 99999999)
ggplot(data = data1,
mapping = aes(x = LABFORCE ,
y = INCWAGE )) + geom_point()