| Characteristics | |
|---|---|
| Probability sampling |
|
| Non-probability sampling |
|
Simple random sampling: Simple random sample is a subset of a population that each unit in the population has an equal opportunity of selection. Common ways this sampling method is conducted is through lottery, or random draws.
Stratified sampling: population are divided in two two more types that makes up the population. Then the researchers will randomly draw a certain percentage of the data from each characteristic. If the weights in which the samples are draw is evenly distributed among all data types, this method is called a proportional stratified sampling. If the weights are not given equally among the types, then it’s called disproportional stratified sampling.
Snowball sampling: New units of selection are recruited based on other existing units to form part of a sample. This sampling method can be a effective way to conduct research about people with specific traits who might otherwise be difficult to identify.
Quota sampling: sample selection is based on a predetermined number of units. First the population is divided into different strata/sub-groups that makes up the the characteristics of the population that researchers are interested in. Then, the data is selected based on the quota of each subgroup that’s by the researchers. It is an effective way to allow researchers to have control on what or who makes up the sample.
According IPUMS-CPS’s website, it states that “IPUMS-CPS samples are weighted” and due to the layout of data categories, I think IPUMS_CPS uses stratified sampling method. Which allows the sample makers to subgroup different features included in the population. Then, select certain observations to form the sample based on the weight that’s assigned to each subgroup.
As it states on IPUMS-CPS’s website, “The IPUMS-CPS samples are weighted, with some records representing more cases than others. This means that persons and households with some characteristics are over-represented in the samples, while others are underrepresented.” Thus, in certain areas of the data, researchers might not be able to capture the entirety of the population, either due to over-representation or under-representation in the samples provided by IPUMS-CPS
# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).
if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")
## Loading required package: ipumsr
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(dplyr)
ddi <- read_ipums_ddi("/Users/pin.lyu/Desktop/cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
# Remove variables not needed
df <-subset(data,select = c(YEAR,LABFORCE,INCWAGE))|>
as.data.frame()
# Basic data summary
stargazer(df,type = "text", title = "Table 1: Summary Statistics of IPUMS Labor/Income Data")
##
## Table 1: Summary Statistics of IPUMS Labor/Income Data
## ==================================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------------------
## YEAR 4,198,268 2,021.700 1.019 2,019 2,023
## LABFORCE 4,198,268 1.306 0.761 0 2
## INCWAGE 800,468 20,957,460.000 40,665,920.000 0 99,999,999
## ------------------------------------------------------------------
# Remove "0" in variable "LABFORCE"
## 0 = "NIU" which stands for "Not In Universe". This means that the question was not asked to the person was surveyed. Therefore, we should delete these data entries.
df <- df[df$LABFORCE != 0, ]
# Remove 99999999 in variable "INCWAGE" = "Not In Universe"
df <- df[df$INCWAGE != 99999999, ]
# Summary of the new data frame
stargazer(df, type = "text", title = "Table 2: Summary Statistics of IPUMS Labor/Income Data")
##
## Table 2: Summary Statistics of IPUMS Labor/Income Data
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## YEAR 630,206 2,020.917 1.418 2,019 2,023
## LABFORCE 630,206 1.615 0.487 1 2
## INCWAGE 630,206 35,564.870 66,800.890 0 2,099,999
## -------------------------------------------------------
# Elminating Extreme large data in "INCWAGE"
Q <- quantile(df$INCWAGE, probs=c(.25, .75), na.rm = T)
IQR <- IQR(df$INCWAGE, na.rm = T)
Lower <- Q[1] - 1.5*IQR
Upper <- Q[2] + 1.5*IQR
df<- subset(df, df$INCWAGE > (Q[1] - 1.5*IQR) & df$INCWAGE < (Q[2]+1.5*IQR))
# Graph in boxplot
boxplot(df$INCWAGE)
# Summary of the new data frame
stargazer(df, type = "text", title = "Table 2: Summary Statistics of IPUMS Labor/Income Data")
##
## Table 2: Summary Statistics of IPUMS Labor/Income Data
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## YEAR 597,533 2,020.907 1.418 2,019 2,023
## LABFORCE 597,533 1.594 0.491 1 2
## INCWAGE 597,533 25,213.690 31,118.520 0 124,800
## -----------------------------------------------------
# Take log of income to better allows us to see more detailed distribution
ggplot(df, aes(x = LABFORCE, y = log(INCWAGE) )) +
geom_point() +
labs(title = "Income & Labor Participation Plot (in log)",
x = "Labor Force Participation",
y = "Yearly Income In Dollars (pre-tax)") +
theme_minimal()
Given the plot shown above, we can say that income level does not vary based on labor force participation rate. One explanation I thought that could account for this phenomenon is due to self-employment which can be categorized as not in the labor force. However, upon reading how U.S. Bureau of Labor Statistics categorize the data, I quickly discarded this explanation because they label self-employed individuals as in the labor force. In our data set, this type of individuals would be assorted under “2” in variable “LABFORCE”.
Individuals from both groups have the same distribution of income level from an birds eye view. Therefore, based on what we have so far in the data, we cannot say labor force participation rate increases one’s income level.