In this report I present an approach to benchmark the Current Population Survey (CPS) with the Quarterly Census of Employment and Wages (QCEW). I rely on the assumptions that the CPS has an accurate occupation share by industry and that the number of workers per occupation per industry in the CPS can be defined as a product of the observed CPS number by the ratio of the number of workers in that industry given by the CPS and the QCEW. I apply this method to the occupations of nurses and doctors and find that the CPS mainly underestimates the number of nurses. This affects the analysis of missing workers, as I do not find a shortage of nurses with the benchmarked CPS.
Let \(l^{c}_{iot}\) be the number of workers in industry \(i\) and occupation \(o\) in month \(t\), according to the Current Population Survey (CPS), and let \(l^q_{it}\) be the number of workers according to the Quarterly Census of Employment and Wages (QCEW), which does not have information at the occupation level. Define \(s_{iot}=l^c_{iot}/l^c_{it}\) as the share of workers of occupation \(o\) in industry \(i\) and period \(t\), based on the CPS. Assume that this share is accurate but the total number of workers per industry is inaccurate, in the form \(l^c_{it}=\xi_{it} \cdot l^q_{it}\).
Assuming that \(\xi_{it}\) is the same across occupations, we can estimate it for each industry and benchmark the CPS with the QWES following the next steps. First, calculate the scaling factor for industry \(i\) as \(\xi_{it} = l^c_{it}/l^q_{it}\), which we can observe in the data. Second, re-scale the number of workers in each industry and occupation as \(\hat{l}^c_{iot}=\xi_{it}\cdot l^c_{iot}\). With this approach, the CPS will be closer to the QCEW, which is treated as the ground truth here, i.e., \(\sum_{i=1}^{N}(\hat{l}^c_{it}-l^q_{it})^2 \le \sum_{i=1}^{N}(l^c_{it}-l^q_{it})^2\). Lastly, sum the number of workers by occupation across industries, i.e., \(\hat{l}^c_{ot}=\sum_{i=1}^N\hat{l}^c_{iot}\), which is what we will use to calculate the number of missing workers per occupation.
An alternative approach to use information from both the CPS and the QCEW is to use the shares of workers by industry from the CPS to calculate a share-weighted number of workers per occupation in the QCEW by industry, i.e. \(\hat{l}^q_{iot}=s_{iot}\cdot l^q_{it}\).
Note on time periods. The QCEW gives information of total employment per industry per month. But the data is only available till March/2023.
I will do the analysis suggested above only for nurses and physicians, for simplicity.
Upload raw CPS data.
# Employed individuals on CPS
cps <- read_dta("../masterfile/2003_2023.dta") %>%
filter(empstat %in% c(10,12)) %>% # "At work" or "Has job, not at work last week"
mutate(date = paste(year,month,"1",sep="-") %>% as.Date("%Y-%m-%d")) # formats date
From 2020 onwards, the PSC has split the hospital industry code (8190) into two more specific codes (8191 and 8192). This complicates the match between CPS and QCEW that we’ll make later. That’s why, from 2020, I’ll be replacing the new industry codes with the old ones. See more details about the change at this link.
cps <- cps %>% mutate(ind = replace(ind, ind %in% c(8191,8192), 8190))
Filter CPS to have only nurses and doctors.
# Filter to nurses and doctors
nurses_doctors <- cps %>%
mutate(nurse = ifelse(occ2010 %in% c(3500,3600,3130),1,0), # create nurse dummy
doctor = ifelse(occ2010==3060,1,0)) %>% # create doctor dummy
filter(nurse==1 | doctor==1) # keep only doctors and nurses
The QCEW data is organized as follows. There is one spreadsheet per year per industry, in each of this, there is employment data per month, state, and county. Within the state, employed workers are divided by employer (Federal Government, Private Sector, Local Government, etc.).
The CPS uses industry classification codes for detailed industry based on the 2017 North American Industry Classification System (NAICS) but more aggregate. The QCEW uses the NAICS of different years to classify industries in different years (more details here). To merge these datasets I manually aggregate the QCEW series using the CPS corresponding codes, available at this online attachment [link].
After manually matching the 129 CPS industry codes to 171 NAICS codes, for nurses and doctors, I can follow the steps below to calculate the scaling factors and benchmark the CPS with the QCEW.
Import the industry codes and see how many files I can match in the QCEW.
# Import industry code dictionary
ind_codes <- read.csv("../input_data/cps_qcew_ind.csv")
Aggregate workers across employers and states.
# Function to process CSV
process_csv <- function(year, industry, quiet=F) {
# Import CSV and edit
year_folder <- paste0("../qcew_data/",year,"_qtrly_by_industry/",
year,ifelse(year<2023,".q1-q4",".q1-q1"),".by_industry/")
# Condition on finding file for given industry
missing <- list.files(year_folder,
pattern = paste0(".* ", industry, " .*")) %>% is_empty
if(missing==F) {
# Process data
ind_df <- read.csv(paste0(year_folder, list.files(year_folder,
pattern = paste0(".* ", industry, " .*")))) %>% # import csv
filter(str_detect(area_title, "Statewide")) %>% # keep only aggregate by state
select(year, qtr, industry_code, month1_emplvl, month2_emplvl, month3_emplvl) %>% # keep variables of interest
group_by(qtr) %>% # group by quarter to collapse summing workers
summarise(year = year[1], ind = industry_code[1], # keep year and industry
month1_empl = sum(month1_emplvl), month2_empl = sum(month2_emplvl),
month3_empl = sum(month3_emplvl)) %>% # sum number of workers
pivot_longer(cols = starts_with("month"), # reshapes data to have it by industry-date
names_to = "month", values_to = "empl",
names_pattern = "month(.)_empl") %>%
mutate(date = paste(year,1:ifelse(year<2023,12,3),"1",sep="-") %>% as.Date("%Y-%m-%d")) %>% # create date variable
select(date, ind, empl) # keep variables of interest
# Save clean processed csv
write.csv(ind_df, paste0("../qcew_data/", year, "_qtrly_by_industry/",
year, "_", industry, ".csv"), row.names = F)
}
else { # print the ones we are not able to match
if(quiet==F) {
paste0("For year ", year," we cannot match NAICS code ",
industry, " to any QCEW file.") %>% print
}
}
}
This part of the code is commented out because it is inefficient to run it every time. The csv files were processed already and the next code chunk uses them directly, skipping this part.
# # Apply function for all industries and years
# for (y in 2003:2023) {
# for (i in 1:length(ind_codes$naics)) {
# process_csv(y, ind_codes$naics[i], quiet=T)
# }
# }
For every year, there are 5 NAICs codes that I cannot match with the QCEW files but they are not healthcare related and have low share of nurses and doctors (less than or equal to 0.003).
Lastly, I append all the separate QCEW year-industry datasets created above.
# First dataset to start appending all others
qcew <- read.csv(paste0("../qcew_data/2003_qtrly_by_industry/",
list.files("../qcew_data/2003_qtrly_by_industry/")[2]))
# Append every industry for every year
for(y in 2003:2023) { # loop starts at third file for 2003 because second is above, first is dir with raw data
for(i in ifelse(y>2003,2,3):length(list.files(paste0("../qcew_data/",y,"_qtrly_by_industry/")))) {
qcew <- qcew %>%
rbind(read.csv(paste0("../qcew_data/",y,"_qtrly_by_industry/",
list.files(paste0("../qcew_data/",y,"_qtrly_by_industry/"))[i])))
}
}
# Sort and rename
qcew <- qcew %>% arrange(ind, date) %>% rename(naics=ind)
Merge QCEW data with industry code dictionary and sum workers across NAICs codes for a given CPS industry category.
# Merge with CPS codes and sum by industry-date
qcew <- left_join(qcew, ind_codes %>% select(cps_ind,naics), by = "naics") %>%
group_by(cps_ind, date) %>% summarise(lq_i = sum(empl)) %>% # sum workers by CPS industry code
rename(ind = cps_ind) %>% mutate(date = as.Date(date))
Given all the procedures above, now I can merge the CPS with the QCEW to compare them and benchmark the former with the latter.
# Merge
cps_qcew <- inner_join(cps_nurse_doctor_industry, # merging keeping only matched observations!
qcew, by = c("ind", "date")) %>%
mutate(xi = lc_i/lq_i) %>% # calculate scaling factor
mutate(xi_w1 = winsorize(xi,.01), xi_w5 = winsorize(xi,.05),
xi_w10 = winsorize(xi,.1)) %>% # winsorized scaling factor
mutate(lc_io_hat = xi*lc_io, lc_io_hat_w1 = xi_w1*lc_io,
lc_io_hat_w5 = xi_w5*lc_io, lc_io_hat_w10 = xi_w10*lc_io,# calculate scaled l^cps_io
lq_io_hat = rho_io*lq_i) # calculate share-weighted QCEW
# Collapse
nurses_doctors_collapsed <- cps_qcew %>% group_by(nurse, date) %>%
summarise(lc_o = sum(lc_io), lc_o_hat = sum(lc_io_hat),
lc_o_hat_w1 = sum(lc_io_hat_w1), lc_o_hat_w5 = sum(lc_io_hat_w5),
lc_o_hat_w10 = sum(lc_io_hat_w10), lq_o_hat = sum(lq_io_hat))
Below we can see the summary statistics for \(\xi_{it}\). There are a lot of zeros and small values, with a median of 1.03 and 75th percentile of 1.46. However, there are also some very high values with a maximum of 2092.94, bringing the average to 7.57. Given these extreme values, I also present summary statistics and histograms for the variable winsorized at different percentiles.
The weighted average uses the number of doctors employed in a given industry in the first month of the sample period (January 2003). The idea is to place more weight on industries that have more doctors, avoid using time-varying weights and avoid having the weights related to the variables that go into measuring \(\xi_{it}\) itself.
Note: I am winsorizing across the whole distribution, not within industry. I am not completely sure this is the right approach.
# Create winsorized versions of xi
cps_qcew <- cps_qcew %>%
mutate(xi_w1 = winsorize(xi,.01),
xi_w5 = winsorize(xi,.05),
xi_w10 = winsorize(xi,.1))
| Variable | Min | Q1 | Median | Mean | Q3 | Max | Weighted mean | SD |
|---|---|---|---|---|---|---|---|---|
| \(\xi_{it}\) | 0 | 0.618 | 1.04 | 7.698 | 1.473 | 2092.94 | 1.087 | 76.826 |
| \(\xi_{it}\) winsorized at 1/99% | 0 | 0.618 | 1.04 | 1.284 | 1.473 | 10.449 | 1.081 | 1.428 |
| \(\xi_{it}\) winsorized at 5/95% | 0.037 | 0.618 | 1.04 | 1.104 | 1.473 | 2.693 | 1.046 | 0.658 |
| \(\xi_{it}\) winsorized at 10/90% | 0.335 | 0.618 | 1.04 | 1.077 | 1.473 | 2.003 | 1.019 | 0.53 |
The histograms below show the distribution for \(\xi_{it}\) without any weighting.
If we look at the average \(\xi_{it}\) overtime, we can see that the higher values are in the early 2000s, but there are some spikes across the whole sample period. The average is notably low for more recent periods. The median, however, is close to one for the whole sample period. If we weight this average by the number of doctors in each industry in January 2003, however, there is still a downward trend but the values are closer to one for the whole sample period.
Below we can see a comparison of the raw CPS with the CPS benchmarked by the QCEW for nurses and doctors, i.e., we can see \(l^c_{ot}\) and \(\hat{l}^c_{ot}\) for these occupations. I also plot the linear time trend that we would use to calculate the number of missing workers. There are no missing nurses using the benchmarked CPS without winsorization, only when we winsorize the scaling factor at the 5/95th and 10/90th percentiles. There are missing doctors with and without winsorization but there is also a very odd data point before 2020, which may be misleading and affecting the slope of the time trend.
# Figures by occupation
df_plot <- nurses_doctors_collapsed %>%
pivot_longer(cols = c("lc_o_hat", "lc_o_hat_w1", "lc_o_hat_w5", # reshapes
"lc_o_hat_w10", "lc_o", "lq_o_hat"),
names_to = "type", values_to = "l_o")
# Calculate linear trend
calc_exp_values <- function(nurse_reg, var) {
# Run linear regression from Jan2013 to Jan2020
regression <-
lm(l_o ~ date,
data = df_plot %>%
filter(date %within% interval(ymd("2013-01-01"),ymd("2020-01-01")) &
nurse==nurse_reg & type==var))
# Calculate expected average for last 6 months
exp_values <- regression$coefficients["(Intercept)"] +
regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
# Add to dataset
df_plot$l_o_reg[df_plot$nurse==nurse_reg & df_plot$type==var] <<- exp_values
}
for(i in 0:1) {
for(v in c("lc_o",paste0("lc_o_hat",c("","_w1","_w5","_w10")),"lq_o_hat")) {
calc_exp_values(i,v)
}
}
Here, I plot the CPS, the QCEW, the benchmarked CPS, and the CES for the main industry codes of interest, i.e., the ones with higher share of nurses and doctors. I calculate the benchmarked CPS at the industry level as \(\hat{l}^c_{it}=\frac{1}{\xi_{it}} l^q_{it}\). This also work as a sanity check to confirm that \(\hat{l}^c_{it}=\hat{l}^q_{it}\), which should be true by design. The QCEW seems to be smoother than the CPS, as expected. I did not add the CES to all industries, but I can do it later if necessary.
For some industries, the values are notably different, namely, offices of physicians, outpatient care centers, and other health care services.
# Read CES data
ces <- read_csv("../../CES/data/ces_selected_industries.csv")
# Merge with CPS and QCEW
cps_qcew <- cps_qcew %>%
left_join(ces, by = c("ind","date"))
# Figures by occupation
df_plot <- cps_qcew %>% mutate(lc_i_hat = lc_i/xi) %>% filter(nurse==1) %>% # by industry, so keep only one occ
select(ind, date, lc_i, lc_i_hat, lq_i, le_i) %>%
pivot_longer(cols = c("lc_i", "lc_i_hat", "lq_i", "le_i"), # reshapes
names_to = "type", values_to = "l_i")
# Calculate linear trend
calc_exp_values <- function(industry, var) {
# Run linear regression from Jan2013 to Jan2020
df_reg <- df_plot %>% # filter for interval, var, and industry of interest
filter(date %within% interval(ymd("2013-01-01"),ymd("2020-01-01")) &
type==var & ind==industry)
df_plot$l_i_reg <- NA
if(all(is.na(df_reg$l_i))==F) {
regression <- lm(l_i ~ date, data = df_reg)
# Calculate expected average for last 6 months
exp_values <- regression$coefficients["(Intercept)"] +
regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
# Add to dataset
df_plot$l_i_reg[df_plot$type==var & df_plot$ind==industry] <<- exp_values
}
}
for(i in main_industries$ind) {
for(v in df_plot$type %>% unique) {
calc_exp_values(i,v)
}
}
Here, I plot missing workers by occupation using only the CPS and extrapolating the pre-pandemic trends with regressions starting on 2013, 2015, and 2017.
df_plot <- nurses_doctors_collapsed %>% select(nurse, date, lc_o)
# Calculate linear trend
calc_exp_values <- function(nurse_reg, start_year) {
# Run linear regression from Jan2013 to Jan2020
regression <-
lm(lc_o ~ date,
data = df_plot %>%
filter(date %within%
interval(ymd(paste0(start_year,"-01-01")),ymd("2020-01-01")) &
nurse==nurse_reg))
# Calculate expected average for last 6 months
exp_values <- regression$coefficients["(Intercept)"] +
regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
# Add to dataset
assign(paste0("df_plot$lc_o_reg_",start_year-2000,"[df_plot$nurse==nurse_reg]"),
exp_values)
if(start_year==2013) {
df_plot$lc_o_reg_13[df_plot$nurse==nurse_reg] <<- exp_values
}
if(start_year==2015) {
df_plot$lc_o_reg_15[df_plot$nurse==nurse_reg] <<- exp_values
}
if(start_year==2017) {
df_plot$lc_o_reg_17[df_plot$nurse==nurse_reg] <<- exp_values
}
}
for(i in 0:1) {
for(y in c(2013,2015,2017)) {
calc_exp_values(i,y)
}
}
Figures with 3- and 6-month rolling averages of the CPS are presented below. The first two figures are for the 3-month rolling average.
for(k in c(3,6)) {
# Calculate rolling average
df_plot <- nurses_doctors_collapsed %>% select(nurse, date, lc_o) %>%
mutate(lc_o = rollmean(lc_o, k, fill=NA, align='right'))
# Calculate linear extrapolations using rolling average
for(i in 0:1) {
for(y in c(2013,2015,2017)) {
calc_exp_values(i,y)
}
}
# Plots
plot_missing_workers(0) %>% plot
plot_missing_workers(1) %>% plot
}
CPS industry codes and NAICS: https://www2.census.gov/programs-surveys/cps/methodology/Industry%20Codes.pdf and https://www.bls.gov/cps/cenind2012.pdf
CPS occupation codes: https://cps.ipums.org/cps/codes/occ_20112019_codes.shtml
CPS variables: https://cps.ipums.org/cps-action/variables/group/core_work
QCEW variables: https://www.bls.gov/cew/about-data/downloadable-file-layouts/high-level/home.htm
QCEW data files: https://www.bls.gov/cew/downloadable-data-files.htm