Q1

The population of interest is civilian noninstitutionalized population who are 16 and older. CPS excludes the military and people who are in prison/confined to a different institution. It is split between working and non working. As an example of who are considered not working, students and housewives are considered not actively working for jobs. Those considered working are people with full time or part time jobs.

Q2

The sample is 62,000 households.

Q3

Yes, I do think CPS is a reprenstative sample of the entire US population. They take into account many different aspects. One example is the different categories they have for the “non working” population. They have absent from job, on layoff awaiting recall, actively looking and non of the above. They are able to split the data up into many different categories.

library(ipumsr)

Including Plots

You can also embed plots, for example:

ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
summary(data)
##       YEAR          SERIAL          MONTH           HWTFINL      
##  Min.   :2019   Min.   :    1   Min.   : 1.000   Min.   :    0   
##  1st Qu.:2021   1st Qu.:17959   1st Qu.: 3.000   1st Qu.: 1551   
##  Median :2021   Median :35793   Median : 5.000   Median : 3368   
##  Mean   :2021   Mean   :36587   Mean   : 5.582   Mean   : 3071   
##  3rd Qu.:2022   3rd Qu.:54426   3rd Qu.: 9.000   3rd Qu.: 4269   
##  Max.   :2023   Max.   :94633   Max.   :12.000   Max.   :17273   
##                                                  NA's   :654335  
##      CPSID              ASECFLAG          ASECWTH            PERNUM      
##  Min.   :0.000e+00   Min.   :1.0       Min.   : 109.7    Min.   : 1.000  
##  1st Qu.:2.020e+13   1st Qu.:1.0       1st Qu.: 970.7    1st Qu.: 1.000  
##  Median :2.021e+13   Median :1.0       Median :1888.0    Median : 2.000  
##  Mean   :1.882e+13   Mean   :1.2       Mean   :1937.2    Mean   : 2.128  
##  3rd Qu.:2.021e+13   3rd Qu.:1.0       3rd Qu.:2563.5    3rd Qu.: 3.000  
##  Max.   :2.023e+13   Max.   :2.0       Max.   :9975.4    Max.   :16.000  
##                      NA's   :2394449   NA's   :2602318                   
##      WTFINL           CPSIDP              ASECWT           LABFORCE    
##  Min.   :    0    Min.   :0.000e+00   Min.   :   86.5   Min.   :0.000  
##  1st Qu.: 1554    1st Qu.:2.020e+13   1st Qu.:  990.2   1st Qu.:1.000  
##  Median : 3357    Median :2.021e+13   Median : 1911.4   Median :1.000  
##  Mean   : 3140    Mean   :1.882e+13   Mean   : 2000.1   Mean   :1.303  
##  3rd Qu.: 4349    3rd Qu.:2.021e+13   3rd Qu.: 2681.4   3rd Qu.:2.000  
##  Max.   :31157    Max.   :2.023e+13   Max.   :17421.8   Max.   :2.000  
##  NA's   :654335                       NA's   :2602318                  
##     INCWAGE        
##  Min.   :       0  
##  1st Qu.:       0  
##  Median :   32000  
##  Mean   :21125081  
##  3rd Qu.:  124000  
##  Max.   :99999999  
##  NA's   :2602318

Plot before transforming

library("ggplot2")

ggplot(data=data,
       mapping = aes(x=LABFORCE, y= INCWAGE)) + geom_point()
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/integer>. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
## Warning: Removed 2602318 rows containing missing values (`geom_point()`).

Apply Transformation

data <- data[c(1,3,12,13)]

library("psych")
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(data$INCWAGE)
##    vars      n     mean       sd median  trimmed     mad min   max range skew
## X1    1 654335 21125081 40786003  32000 21125081 47443.2   0 1e+08 1e+08 1.42
##    kurtosis       se
## X1     0.01 50420.96
data$INCWAGE[ data$INCWAGE == 99999999] <- NA

data$INCWAGE[ data$INCWAGE == 99999998] <- NA

describe(data$INCWAGE)
##    vars      n     mean       sd median  trimmed   mad min     max   range skew
## X1    1 516286 34825.21 65186.22  15000 34825.21 22239   0 2099999 2099999  7.8
##    kurtosis    se
## X1   107.24 90.72

Plot

ggplot(data=data,
       mapping = aes(x=LABFORCE, y= INCWAGE)) + geom_point() + labs(x = "Labor Force Status", y= "Wage and Salary Annual Income") + scale_x_discrete()
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
## Warning: Removed 2740367 rows containing missing values (`geom_point()`).

This graph of wage and salary income versus labor force status does make sense. For the labor force status, 0 corresponds to people who are not in the universe (NIU). This includes people who are in the military, those who are imprisoned and children until 14. For that reason, it does make sense that their wage and salary annual income is much less than those who are not in the labor force and are in the labor force. However, there are a few outliers in the 0 category. I wonder if CPS sometimes misclassifies people who were previously in the military and now they are not. 2 corresponds to people who work full time/part time. It makes sense that they would have the greatest wage and salary annual income.