I. SAMPLING METHODOLOGIES

In your own words, after your online readings, describe what is probability sampling and non-probability sampling (max 5 lines each). You can even talk about the differences between the sampling methodologies.

Sampling methods allow us to select a subset of individuals, or a sample, from a larger population, so that we may make inferences about the entire population. Probability and non-probability are two methods of sampling. Probability sampling involves selecting samples so that every individual in a population has an equal change of selection. This helps us avoid bias in our results, more so than non-probability sampling, but not completely. Probability sampling includes simple random sampling random sample, cluster sampling, systematic sampling, and stratified random sampling.

In non-probability sampling, every individual does not have an equal chance of selection. In non-probability sampling, we may make generalizations about a population, therefore it’s less reliable than probability sampling. Though, it tends to be more cost effective. Non-probability sampling includes convenience sampling, snowball sampling, and quota sampling.

Which probability sampling method should we use?

Probability sampling i a better method as it can help reduce bias, though not completely, but it helps ensure our sample is chosen at random. This method can also help ensure the sample is more representative of the population. Probability sampling can be expensive and time consuming though, as it takes having access to large amounts of data and applying methodologies to help ensure the subjects are chosen at random. That said, it depends on your research design and what you’re trying to achieve.

Also describe at least 2 survey designs from each of the two categorizations above.

Systematic sampling is a type of probability sampling that draws a random sample from the target population by selecting units at regular intervals, starting from a random point. This is a good method to use when there are existing records of our target population, like a company’s employment records or list of attendees at a concert. To conduct this sampling method, we divide our population into intervals/segments by our desired sample size. We select one unit from the first interval using simple random sampling, and then select the next unit from other intervals depending on the position of the unit selected in the first interval (i.e. 13th position).

Cluster sampling is a type of probability sampling, similar to stratified sampling, but in this method we divide our target population into groups/clusters. Then, randomly select a subsection of these groups/clusters to form our sample. It’s helpful when we want to study large, geographically dispersed populations where there are groups that exist in some way (e.g., classes in school, student orgs, departments at a company).

Convenience sampling is a type of non-probability sampling. It’s cost effective and quick. You choose individuals based on convenience and availability to participate in testing. For example, studying of people from your workplace on their thoughts about pay transparency.

Consecutive sampling is a type of non-probability sampling, where you choose a person or small group, conduct their research, analyze the results and gather more samples if necessary. For example, conducting studies at public locations, like a grocery store or mall, distributing surveys to people walking by.

II. The Current Population Survey (CPS) is a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics. It provides a comprehensive body of data on the labor force, employment, unemployment, persons not in the labor force, hours of work, earnings, and other demographic and labor force characteristics.

In particular, CPS is a sample survey of about 60,000 eligible households (in 2017) scientifically selected to reflect the entire U.S. civilian noninstitutional population. On the basis of responses to a series of questions on work and job search activities, each person 16 years and over in a sample household is classified as employed, unemployed, or not in the labor force.

What is the population of interest in the Current Population Survey (CPS)? Be precise – are there any age criteria, occupation criteria, and/or geographic criteria?

The population of interest here is the entire U.S. civilian non-institutional population, over 16 years of age. It excludes active duty military, people confined to an institution like jails or nursing homes. It does include citizens of foreign countries who reside in the US but do not live in an embassy. It includes people who are classified as employed and unemployed.

What is the sample used to estimate the population parameters/headline statistics like unemployment rate, employment to population ratio?

The sample survey is about 60,000 eligible households (in 2017) scientifically selected. The CPS uses a monthly survey asking questions to this sample about work and job search activities. Each person 16 years and over in a sample household is classified as employed, unemployed, or not in the labor force. The CPS uses multi-stage stratified samples. The first stage involves dividing each U.S. state into primary sampling units, most of which consist of a city, large county, or group of smaller adjacent counties. Within each each these units are grouped into homogeneous strata with respect to labor force and other socio-economic characteristics correlated with unemployment. One unit is sampled per stratum where probability of selection for each unite is proportional to its population. In the second stage, a systemic sample of housing units is drawn from within each chosen unit. Ultimate sampling units are clusters of about 4 housing units. Occasionally, a third stage of sampling is necessary when actual ultimately unite size is very large. This multi-stage, stratified sampling method is about equivalent to dividing the entire US into ultimate units and selecting a clustered sample of these.

Be precise - how many households and/or people are included? Is there anything interesting/unique about the survey methodology that you found?

The CPS contains a monthly sample of 60,000 households. The number of people depends on the number of people in the household. What’s interesting about this methodology is the multi-stage approach to sampling. Due to the diverse makeup of the US, it makes sense that we’d need to collect data from different types of areas, thus dividing the units into cities, large counties, and aggregated groups of smaller counties. Additionally, grouping further based on socioeconomic criteria makes sense so one group (say wealthier or poorer individuals) don’t bias the results. What’s also interesting is that households are rotated into the survey, so a household is interviewed for 4 successive months, then not interviewed for 8 months, then returned to the sample fo 4 months. I imagine this helps reduce bias, where we would be getting the same answers if we kept interviewing the same households.

Do you think CPS is a representative sample of the US entire population after reading about its methodology or your online research?

Yes, the CPS does appear to be a representative sample of the US population due to its multi-stage, stratified, clustered approach to capture different geographic areas and socio-economic factors.

Plot / summarize income wage variable by labor force status. Do you find any patterns in labor force statistics that make sense, such as income varying by labor force status?

#install.packages("ipumsr")
library("ipumsr")
setwd("/Users/ginaocchipinti/Documents/ADEC 7310  Data Analytics/Week 4")
ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
summary(data)
##       YEAR          SERIAL          MONTH           HWTFINL       
##  Min.   :2009   Min.   :    1   Min.   : 1.000   Min.   :    0    
##  1st Qu.:2017   1st Qu.:19562   1st Qu.: 3.000   1st Qu.: 1575    
##  Median :2021   Median :38965   Median : 3.000   Median : 3404    
##  Mean   :2019   Mean   :40246   Mean   : 4.876   Mean   : 3120    
##  3rd Qu.:2022   3rd Qu.:58959   3rd Qu.: 7.000   3rd Qu.: 4330    
##  Max.   :2024   Max.   :99461   Max.   :12.000   Max.   :20133    
##                                                  NA's   :2777271  
##      CPSID              ASECFLAG           HFLAG            ASECWTH       
##  Min.   :0.000e+00   Min.   :1         Min.   :0         Min.   :   53    
##  1st Qu.:2.013e+13   1st Qu.:1         1st Qu.:0         1st Qu.:  871    
##  Median :2.021e+13   Median :1         Median :0         Median : 1623    
##  Mean   :1.724e+13   Mean   :1         Mean   :0         Mean   : 1785    
##  3rd Qu.:2.022e+13   3rd Qu.:1         3rd Qu.:1         3rd Qu.: 2301    
##  Max.   :2.024e+13   Max.   :2         Max.   :1         Max.   :28654    
##                      NA's   :3592239   NA's   :6474169   NA's   :3896454  
##      PERNUM           WTFINL            CPSIDP              CPSIDV         
##  Min.   : 1.000   Min.   :    0     Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.: 1.000   1st Qu.: 1581     1st Qu.:2.013e+13   1st Qu.:2.013e+14  
##  Median : 2.000   Median : 3402     Median :2.021e+13   Median :2.021e+14  
##  Mean   : 2.167   Mean   : 3200     Mean   :1.724e+13   Mean   :1.724e+14  
##  3rd Qu.: 3.000   3rd Qu.: 4422     3rd Qu.:2.022e+13   3rd Qu.:2.022e+14  
##  Max.   :16.000   Max.   :44748     Max.   :2.024e+13   Max.   :2.024e+14  
##                   NA's   :2777271                                          
##      ASECWT           LABFORCE        INCWAGE        
##  Min.   :   50     Min.   :0.000   Min.   :       0  
##  1st Qu.:  882     1st Qu.:1.000   1st Qu.:       0  
##  Median : 1638     Median :1.000   Median :   30000  
##  Mean   : 1828     Mean   :1.292   Mean   :22444229  
##  3rd Qu.: 2389     3rd Qu.:2.000   3rd Qu.:  129000  
##  Max.   :44424     Max.   :2.000   Max.   :99999999  
##  NA's   :3896454                   NA's   :3896454

It does appear that labor force participation varies by income and wage. As people earn more money, they have a higher labor force participation rate. This makes sense as higher incomes/wages generally demand working to obtain them.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data1 <- data  %>%   filter(INCWAGE != 99999999)
ggplot(data = data1, 
       mapping = aes(x = LABFORCE ,
        y = INCWAGE  )) + geom_point()

OPTIONAL: Does income/wage varies by region such as state?