Part 1 Sampling Methods

Characteristics
Probability sampling
  1. Randomly selecting sample out from the population.
  2. each unit in the population have an equal chance of being selected
  3. Common probability sampling methods include:
    1. simple random sampling
    2. Systematic sampling
    3. Stratified sampling
    4. Clustered sampling
Non-probability sampling
  1. sample selection is based on researcher’s preference or knowledge as well as availability of the data gathered.
  2. Each unit does not have an equal chance of being selected
  3. Such an sampling method is normally implemented when population parameters are not known.
  4. This category of sampling method has a high risk of creating research biases than probability sampling.
  5. Common probability sampling methods include:
    1. Convenience sampling
    2. Snowball sampling
    3. Quota sampling

Which probability should we use?

  • In an ideal situation where population parameters are all present, to avoid any sampling biases, we should implement probability sampling. This is becaue each unit in the population have an equal opportunity of selection. However, if we are in a case where not all desired parameters are present, we should use non-probability sampling methods. As we proceed this sampling method given the situation, we should bear in mind that this sampling method can lead to substantial biases in our data. Thus, hinder our abilities to gain insight about the reality.

Probability Sampling:

  • Simple random sampling: Simple random sample is a subset of a population that each unit in the population has an equal opportunity of selection. Common ways this sampling method is conducted is through lottery, or random draws.

  • Stratified sampling: population are divided in two two more types that makes up the population. Then the researchers will randomly draw a certain percentage of the data from each characteristic. If the weights in which the samples are draw is evenly distributed among all data types, this method is called a proportional stratified sampling. If the weights are not given equally among the types, then it’s called disproportional stratified sampling.

Non-probability Sampling:

  • Snowball sampling: New units of selection are recruited based on other existing units to form part of a sample. This sampling method can be a effective way to conduct research about people with specific traits who might otherwise be difficult to identify.

  • Quota sampling: sample selection is based on a predetermined number of units. First the population is divided into different strata/sub-groups that makes up the the characteristics of the population that researchers are interested in. Then, the data is selected based on the quota of each subgroup that’s by the researchers. It is an effective way to allow researchers to have control on what or who makes up the sample.

Part 2 The Current Population Survey (CPS)

What is the population of interest in the Current Population Survey (CPS) ?

  • The survey was designed to measure unemployment. The Current Population Survey (CPS) is administered monthly by the U.S. Bureau of the Census to over 65,000 households. These surveys gather information on education, labor force status, demographics, and other aspects of the U.S. population.

What is the sample used to estimate the population parameters/headline statistics. Like unemployment rate, employment to population ratio ?

According IPUMS-CPS’s website, it states that “IPUMS-CPS samples are weighted” and due to the layout of data categories, I think IPUMS_CPS uses stratified sampling method. Which allows the sample makers to subgroup different features included in the population. Then, select certain observations to form the sample based on the weight that’s assigned to each subgroup.

Do you think CPS is a representative sample of the US entire population after reading about its methodology or your online research?

As it states on IPUMS-CPS’s website, “The IPUMS-CPS samples are weighted, with some records representing more cases than others. This means that persons and households with some characteristics are over-represented in the samples, while others are underrepresented.” Thus, in certain areas of the data, researchers might not be able to capture the entirety of the population, either due to over-representation or under-representation in the samples provided by IPUMS-CPS

Import IPUMS CPS data

# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")
## Loading required package: ipumsr
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(dplyr)

ddi <- read_ipums_ddi("/Users/pin.lyu/Desktop/cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

Plot the data

# Remove variables not needed 
df <-subset(data,select = c(YEAR,LABFORCE,INCWAGE))|>
  as.data.frame()

# Basic data summary
stargazer(df,type = "text", title = "Table 1: Summary Statistics of IPUMS Labor/Income Data")
## 
## Table 1: Summary Statistics of IPUMS Labor/Income Data
## ==================================================================
## Statistic     N          Mean         St. Dev.     Min     Max    
## ------------------------------------------------------------------
## YEAR      4,198,268   2,021.700        1.019      2,019   2,023   
## LABFORCE  4,198,268     1.306          0.761        0       2     
## INCWAGE    800,468  20,957,460.000 40,665,920.000   0   99,999,999
## ------------------------------------------------------------------
# Remove "0" in variable "LABFORCE"

  ## 0 = "NIU" which stands for "Not In Universe". This means that the question was not asked to the person was surveyed. Therefore, we should delete these data entries. 

df <- df[df$LABFORCE != 0, ]

# Remove 99999999 in variable "INCWAGE" = "Not In Universe"
df <- df[df$INCWAGE != 99999999, ]

# Summary of the new data frame
stargazer(df, type = "text", title = "Table 2: Summary Statistics of IPUMS Labor/Income Data")
## 
## Table 2: Summary Statistics of IPUMS Labor/Income Data
## =======================================================
## Statistic    N       Mean     St. Dev.   Min     Max   
## -------------------------------------------------------
## YEAR      630,206 2,020.917    1.418    2,019   2,023  
## LABFORCE  630,206   1.615      0.487      1       2    
## INCWAGE   630,206 35,564.870 66,800.890   0   2,099,999
## -------------------------------------------------------
# Elminating Extreme large data in "INCWAGE"
Q <- quantile(df$INCWAGE, probs=c(.25, .75), na.rm = T)
IQR <- IQR(df$INCWAGE, na.rm = T)

Lower <- Q[1] - 1.5*IQR
Upper <- Q[2] + 1.5*IQR 

df<- subset(df, df$INCWAGE > (Q[1] - 1.5*IQR) & df$INCWAGE < (Q[2]+1.5*IQR))


# Graph in boxplot 
boxplot(df$INCWAGE)

Clean data summary statistics

# Summary of the new data frame
stargazer(df, type = "text", title = "Table 2: Summary Statistics of IPUMS Labor/Income Data")
## 
## Table 2: Summary Statistics of IPUMS Labor/Income Data
## =====================================================
## Statistic    N       Mean     St. Dev.   Min    Max  
## -----------------------------------------------------
## YEAR      597,533 2,020.907    1.418    2,019  2,023 
## LABFORCE  597,533   1.594      0.491      1      2   
## INCWAGE   597,533 25,213.690 31,118.520   0   124,800
## -----------------------------------------------------
# Take log of income to better allows us to see more detailed distribution 

ggplot(df, aes(x = LABFORCE, y = log(INCWAGE) )) +
  geom_point() +
  labs(title = "Income & Labor Participation Plot (in log)",
       x = "Labor Force Participation",
       y = "Yearly Income In Dollars (pre-tax)") +
  theme_minimal()

Conclusion

  • Given the plot shown above, we can say that income level does not vary based on labor force participation rate. One explanation I thought that could account for this phenomenon is due to self-employment which can be categorized as not in the labor force. However, upon reading how U.S. Bureau of Labor Statistics categorize the data, I quickly discarded this explanation because they label self-employed individuals as in the labor force. In our data set, this type of individuals would be assorted under “2” in variable “LABFORCE”.

    Individuals from both groups have the same distribution of income level from an birds eye view. Therefore, based on what we have so far in the data, we cannot say labor force participation rate increases one’s income level.