I

The process of choosing a sample from a population so that each person has a known, non-zero chance of being included in the sample is known as probability sampling. By doing this, it is ensured that the sample is representative of the population and that reliable statistical conclusions can be drawn.

The probability that a specific member of the population will be included in the sample is either unknown or unequal in non-probability sampling techniques, which do not use random selection. When using probability sampling techniques is difficult or impracticable, these strategies are frequently used.

Simple random sampling is a viable option if we have an exhaustive list of the whole population and we want each person to have an equal probability of being included.

When the population is heterogeneous and we wish to guarantee representation from several subgroups, we should use stratified sampling. It offers accurate approximations for every subgroup.

The first survey is superior since it uses a cross-sectional design, which gathers information from a sample of people all at once. It is frequently used to comprehend the features of a population at a particular time in a variety of industries, including public opinion polling. In the second design, information is gathered from the same people at various intervals. While it’s helpful for examining shifts and patterns within a group, participants are chosen based on their judgment and level of experience, which makes it appropriate for qualitative research.

https://www.scribbr.com/methodology/sampling-methods/#:~:text=Probability%20sampling%20involves%20random%20selection,you%20to%20easily%20collect%20data

https://www.statisticssolutions.com/what-is-the-difference-between-probability-and-non-probability-sampling/

II

Every month, the U.S. Bureau of the Census conducts the Current Population Survey (CPS), which is completed by more than 65,000 households.

They neglected to include a number of variables that people would take into account before choosing to enter the labor market, such as the type of work, hours worked, location, and compensation, among others.

Among the major worker groups, the unemployment rates for adult men (3.7 percent), adult women (3.3 percent), teenagers (13.2 percent), Whites (3.5 percent), Blacks (5.8 percent), Asians (3.1 percent), and Hispanics (4.8 percent)

Employment rate to population 2023 is 60.2

The threshold for a statistically significant change in the household survey is about 600,000.

One thing kinda interested to me is that white and Asian has the lower umemployment rate than black and Hispanics

I think so, The Current Population Survey (CPS) is conducted monthly, and it uses a rotating panel survey design to estimate population parameters and headline statistics like the unemployment rate and employment-to-population ratio.

4.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ipumsr)

5.

cps_ddi <- read_ipums_ddi("/Users/timyang/Downloads/cps_00001.xml")
cps_data <- read_ipums_micro(cps_ddi, data_file = "/Users/timyang/Downloads/cps_00001.dat", verbose = FALSE)
cps_data
## # A tibble: 4,198,268 × 14
##     YEAR SERIAL MONTH     HWTFINL   CPSID ASECFLAG ASECWTH PERNUM WTFINL  CPSIDP
##    <dbl>  <dbl> <int+lbl>   <dbl>   <dbl> <int+lb>   <dbl>  <dbl>  <dbl>   <dbl>
##  1  2019      4 3 [March]      NA 2.02e13 1 [ASEC]   2032.      1     NA 2.02e13
##  2  2019      6 3 [March]      NA 2.02e13 1 [ASEC]   1232.      1     NA 2.02e13
##  3  2019      7 3 [March]      NA 2.02e13 1 [ASEC]   1209.      1     NA 2.02e13
##  4  2019      8 3 [March]      NA 2.02e13 1 [ASEC]   1146.      1     NA 2.02e13
##  5  2019      8 3 [March]      NA 2.02e13 1 [ASEC]   1146.      2     NA 2.02e13
##  6  2019     13 3 [March]      NA 2.02e13 1 [ASEC]   1588.      1     NA 2.02e13
##  7  2019     15 3 [March]      NA 2.02e13 1 [ASEC]   1583.      1     NA 2.02e13
##  8  2019     18 3 [March]      NA 2.02e13 1 [ASEC]    981.      1     NA 2.02e13
##  9  2019     18 3 [March]      NA 2.02e13 1 [ASEC]    981.      2     NA 2.02e13
## 10  2019     20 3 [March]      NA 2.02e13 1 [ASEC]   1539.      1     NA 2.02e13
## # ℹ 4,198,258 more rows
## # ℹ 4 more variables: CPSIDV <dbl>, ASECWT <dbl>, LABFORCE <int+lbl>,
## #   INCWAGE <dbl+lbl>
cps_ddi
## An IPUMS DDI for IPUMS CPS with 14 variables
## Extract 'cps_00001.dat' created on 2023-11-07
## User notes:
plot <- table(c("INCWAGE", "LABFORCE")) 
barplot(plot, main = "Categorical Variable Distribution")

summary(cps_data)
##       YEAR          SERIAL          MONTH           HWTFINL      
##  Min.   :2019   Min.   :    1   Min.   : 1.000   Min.   :    0   
##  1st Qu.:2021   1st Qu.:18022   1st Qu.: 3.000   1st Qu.: 1567   
##  Median :2022   Median :35856   Median : 5.000   Median : 3392   
##  Mean   :2022   Mean   :36550   Mean   : 5.481   Mean   : 3107   
##  3rd Qu.:2022   3rd Qu.:54342   3rd Qu.: 8.000   3rd Qu.: 4315   
##  Max.   :2023   Max.   :94633   Max.   :12.000   Max.   :18077   
##                                                  NA's   :800468  
##      CPSID              ASECFLAG          ASECWTH            PERNUM      
##  Min.   :0.000e+00   Min.   :1.0       Min.   :  110     Min.   : 1.000  
##  1st Qu.:2.020e+13   1st Qu.:1.0       1st Qu.:  998     1st Qu.: 1.000  
##  Median :2.021e+13   Median :1.0       Median : 1923     Median : 2.000  
##  Mean   :1.890e+13   Mean   :1.3       Mean   : 1982     Mean   : 2.125  
##  3rd Qu.:2.022e+13   3rd Qu.:2.0       3rd Qu.: 2627     3rd Qu.: 3.000  
##  Max.   :2.023e+13   Max.   :2.0       Max.   :10925     Max.   :16.000  
##                      NA's   :3093585   NA's   :3397800                   
##      WTFINL           CPSIDP              CPSIDV              ASECWT       
##  Min.   :    0    Min.   :0.000e+00   Min.   :0.000e+00   Min.   :   86    
##  1st Qu.: 1572    1st Qu.:2.020e+13   1st Qu.:2.020e+14   1st Qu.: 1018    
##  Median : 3388    Median :2.021e+13   Median :2.021e+14   Median : 1951    
##  Mean   : 3182    Mean   :1.890e+13   Mean   :1.890e+14   Mean   : 2048    
##  3rd Qu.: 4402    3rd Qu.:2.022e+13   3rd Qu.:2.022e+14   3rd Qu.: 2749    
##  Max.   :44748    Max.   :2.023e+13   Max.   :2.023e+14   Max.   :17422    
##  NA's   :800468                                           NA's   :3397800  
##     LABFORCE        INCWAGE        
##  Min.   :0.000   Min.   :       0  
##  1st Qu.:1.000   1st Qu.:       0  
##  Median :1.000   Median :   33280  
##  Mean   :1.306   Mean   :20957460  
##  3rd Qu.:2.000   3rd Qu.:  125000  
##  Max.   :2.000   Max.   :99999999  
##                  NA's   :3397800

6.

# Assuming we have a data frame named "cps_data" with columns "LABFORCE" and "INCWAGE"
plot(cps_data$LABFORCE, cps_data$INCWAGE, type = "l", col = "blue",
     xlab = "X-Axis Label", ylab = "Y-Axis Label", main = "Line Plot")

# Assuming we have a data frame named "cps_data" with columns "INCWAGE" and "LABFORCE"
plot(cps_data$INCWAGE, cps_data$LABFORCE, pch = 16, col = "blue", 
     xlab = "X-Axis Label", ylab = "Y-Axis Label", main = "Scatter Plot")

One trend that is evident is the rise in the labor force as a result of population growth. and the percentage of people in the labor force and the area’s income level are negatively correlated.