I. SAMPLING METHODOLOGIES

1. Probability and Non-probability sampling

a. Probability Sampling:

Probability sampling is a method of selecting a sample from a larger population in such a way that each element in the population has a known and non-zero chance of being included in the sample. It relies on random selection techniques, like simple random sampling, stratified sampling, or systematic sampling, to ensure that the sample is representative of the population, making it suitable for statistical inference.

b. Non-Probability Sampling:

Non-probability sampling is a method of selecting a sample from a population in which not every element has a known and equal chance of being included. Instead, sample elements are chosen based on convenience, judgment, or subjective criteria, leading to a potential lack of representativeness. Common non-probability sampling methods include convenience sampling, purposive sampling, and snowball sampling. Non-probability sampling is often used when probability sampling is impractical or too costly but may introduce bias into research results.

Probability sampling uses randomly selected samples to ensure that every community member has an equal chance of being chosen, reducing sampling bias and effectively gathering data from varied groups. Non-probability sampling, on the other hand, is based on subjective researcher judgment, does not provide equal opportunities for all, and is useful in specific environments with similar characteristics, but may not accurately represent the entire population, making it easier to find the right audience but potentially introducing bias.

2. SRS and Purposive sampling

a. Stratified Random Sampling:

Stratified random sampling is a probability sampling method where the population is divided into distinct subgroups (strata) based on certain characteristics. Samples are then randomly selected from each stratum to ensure representation of all subgroups in the final sample. This method is useful for ensuring a more accurate representation of different subpopulations within the larger population.

b. Purposive Sampling:

Purposive sampling, a non-probability sampling technique, involves the deliberate selection of specific individuals or groups from the population based on the researcher’s judgment or specific criteria. It’s used when researchers want to target a particular subset of the population that is most relevant to their research objectives. Purposive sampling is subject to bias but is efficient for targeted research.

remove(list=ls())

II. The Current Population Survey (CPS)

  1. The population of interest in the Current Population Survey (CPS) is the civilian, non-institutionalized population of the United States. There are occupation criteria. However, the CPS is conducted on a national level, covering all 50 states and the District of Columbia, and it includes people of all ages who are not residing in institutions such as correctional facilities or nursing homes. It is designed to provide demographic and labor force information about the U.S. population, making it one of the primary sources of labor force statistics in the country.

    Age criteria is: Youth (16 to 24) or Older(55 or more)

  2. The Current Population Survey (CPS) is typically conducted with a sample size of approximately 60,000 households, comprising around 150,000 individuals. This sample size is used to estimate population parameters and headline statistics, including the unemployment rate, employment-to-population ratio, and other labor force characteristics for the entire civilian, non-institutionalized population of the United States.

    One interesting aspect of the survey methodology is its rotating panel design. The CPS divides the sample into eight groups, with each group interviewed for four consecutive months and then rotated out of the survey for eight months. This design allows for the collection of data on changes in labor force status and other characteristics over time, which is crucial for understanding employment dynamics and trends in the U.S. labor market. The household survey and establishment survey both produce sample-based estimates of employment.

    The CPS is a monthly survey, and it relies on in-person and telephone interviews to gather data. The rotating panel design, combined with the large sample size, helps provide accurate and comprehensive labor force statistics for the U.S. population. This survey methodology enables researchers to track labor market trends and conduct detailed analyses of employment and unemployment patterns.

  3. The Current Population Survey (CPS) aims to be a good reflection of the American population, but it may not be a perfect match. It’s a big and carefully chosen group of people, but some people might not participate, and certain groups could be underrepresented. So, while it provides valuable information, researchers should be cautious about assuming it perfectly represents every aspect of the entire U.S. population.

library(ipumsr)

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
# Replacing with NA values
data$INCWAGE <- ifelse(data$INCWAGE == 99999999, NA, data$INCWAGE)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data %>%
  group_by(LABFORCE) %>%
  summarize(
    Mean_Income = mean(INCWAGE, na.rm = TRUE),
    Median_Income = median(INCWAGE, na.rm = TRUE),
    Min_Income = min(INCWAGE, na.rm = TRUE),
    Max_Income = max(INCWAGE, na.rm = TRUE)
  )
## # A tibble: 3 × 5
##   LABFORCE                       Mean_Income Median_Income Min_Income Max_Income
##   <int+lbl>                            <dbl>         <dbl>      <dbl>      <dbl>
## 1 0 [NIU]                             59852.         54650          0    1099999
## 2 1 [No, not in the labor force]       2388.             0          0    1099999
## 3 2 [Yes, in the labor force]         56369.         40000          0    2099999
# Create a box plot to visualize the distribution of income by labor force status
ggplot(data, aes(x = LABFORCE, y = INCWAGE)) +
  geom_boxplot() +
  labs(x = "Labor Force Status", y = "Income Wage") +
  ggtitle("Income Wage by Labor Force Status")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 3565332 rows containing non-finite values (`stat_boxplot()`).

ggplot(data, aes(x = LABFORCE, y = INCWAGE)) +
  geom_point() +
  labs(x = "Labor Force Status", y = "Income Wage") +
  ggtitle("Scatter Plot of Income Wage by Labor Force Status")
## Warning: Removed 3565332 rows containing missing values (`geom_point()`).

Maximum people involved in labor force has a higher income wage, which may not hold true with the ones not related to labor force status.