1 Summary

In this report I present an approach to benchmark the Current Population Survey (CPS) with the Quarterly Census of Employment and Wages (QCEW). I rely on the assumptions that the CPS has an accurate occupation share by industry and that the number of workers per occupation per industry in the CPS can be defined as a product of the observed CPS number by the ratio of the number of workers in that industry given by the CPS and the QCEW. I apply this method to the occupations of nurses and doctors and find that the CPS mainly underestimates the number of nurses. This affects the analysis of missing workers, as I do not find a shortage of nurses with the benchmarked CPS.

2 Conceptual Framework

Let \(l^{c}_{iot}\) be the number of workers in industry \(i\) and occupation \(o\) in month \(t\), according to the Current Population Survey (CPS), and let \(l^q_{it}\) be the number of workers according to the Quarterly Census of Employment and Wages (QCEW), which does not have information at the occupation level. Define \(s_{iot}=l^c_{iot}/l^c_{it}\) as the share of workers of occupation \(o\) in industry \(i\) and period \(t\), based on the CPS. Assume that this share is accurate but the total number of workers per industry is inaccurate, in the form \(l^c_{it}=\xi_{it} \cdot l^q_{it}\).

Assuming that \(\xi_{it}\) is the same across occupations, we can estimate it for each industry and benchmark the CPS with the QWES following the next steps. First, calculate the scaling factor for industry \(i\) as \(\xi_{it} = l^c_{it}/l^q_{it}\), which we can observe in the data. Second, re-scale the number of workers in each industry and occupation as \(\hat{l}^c_{iot}=\xi_{it}\cdot l^c_{iot}\). With this approach, the CPS will be closer to the QCEW, which is treated as the ground truth here, i.e., \(\sum_{i=1}^{N}(\hat{l}^c_{it}-l^q_{it})^2 \le \sum_{i=1}^{N}(l^c_{it}-l^q_{it})^2\). Lastly, sum the number of workers by occupation across industries, i.e., \(\hat{l}^c_{ot}=\sum_{i=1}^N\hat{l}^c_{iot}\), which is what we will use to calculate the number of missing workers per occupation.

An alternative approach to use information from both the CPS and the QCEW is to use the shares of workers by industry from the CPS to calculate a share-weighted number of workers per occupation in the QCEW by industry, i.e. \(\hat{l}^q_{iot}=s_{iot}\cdot l^q_{it}\).

Note on time periods. The QCEW gives information of total employment per industry per month. But the data is only available till March/2023.

I will do the analysis suggested above only for nurses and physicians, for simplicity.

3 CPS

Upload raw CPS data.

# Employed individuals on CPS
cps <- read_dta("../masterfile/2003_2023.dta") %>% 
  filter(empstat %in% c(10,12)) %>% # "At work" or "Has job, not at work last week"
  mutate(date = paste(year,month,"1",sep="-") %>% as.Date("%Y-%m-%d")) # formats date

From 2020 onwards, the PSC has split the hospital industry code (8190) into two more specific codes (8191 and 8192). This complicates the match between CPS and QCEW that we’ll make later. That’s why, from 2020, I’ll be replacing the new industry codes with the old ones. See more details about the change at this link.

cps <- cps %>% mutate(ind = replace(ind, ind %in% c(8191,8192), 8190))

Filter CPS to have only nurses and doctors.

# Filter to nurses and doctors
nurses_doctors <- cps %>% 
  mutate(nurse = ifelse(occ2010 %in% c(3500,3600,3130),1,0), # create nurse dummy
         doctor = ifelse(occ2010==3060,1,0)) %>% # create doctor dummy
  filter(nurse==1 | doctor==1) # keep only doctors and nurses

3.1 Shares per industry and occupation

For each industry, I calculate the total number of workers, nurses, and doctors for that industry, i.e., \(l^c_{it}\), \(l^c_{iot}\).

I also calculate two shares of nurses and doctors by industry. For a given occupation, the first share is given by the number workers in that occupation in that industry over the total number of workers in that industry, \(\rho_{iot} = l^c_{iot}/\sum_{o=1}^Ol^c_{iot}\). The second share is given by the number of workers in that occupation in that industry over the total number of workers in that occupation across industries, i.e., \(\sigma_{iot}=l^c_{iot}/\sum_{i=1}^I{l^c_{iot}}\). Most of the times in this report, when I refer to a share I am talking about the share of an occupation in a given industry, \(\rho_{iot}\).

For example, \(\rho_{iot}\) for hospitals and doctors is given by the number of doctors in hospitals over the total number of workers in hospitals, and \(\sigma_{oit}\) is given by the number of doctors in hospitals over the total number of doctors across industries.

# Workers per industry
l_cps_i <- cps %>% group_by(ind,date) %>% # group by industry and date
  summarise(lc_i = sum(wtfinl, na.rm = T)) %>% arrange(ind,date) %>% # calculate total number of workers
  slice(rep(1:n(), each = 2)) %>% # duplicate every obs to facilitate merge with dataset by doctor/nurse status
  mutate(nurse = rep(c(1,0),n()/2)) # Add doctor/nurse dummy for each industry to facilitate merge

# Workers per occupation-industry
l_cps_io <- nurses_doctors %>% group_by(ind,date,nurse) %>% # group by occupation-industry
  summarise(lc_io = sum(wtfinl, na.rm = T)) %>% arrange(ind,date,nurse) # calculate total number of workers

# Empty dataset for merge
empty <- data.frame(ind = rep(nurses_doctors$ind %>% unique, # codes for industries that have doctors or nurses
                              cps$date %>% unique %>% length)) %>% # repeat industry codes for every date
  arrange(ind) %>%
  mutate(date = rep(cps$date %>% unique %>% sort, # dates
                    nurses_doctors$ind %>% unique %>% length)) %>% # repeat dates across industries 
  slice(rep(1:n(), each = 2)) %>% # duplicate every obs to facilitate merge with dataset by doctor/nurse status
  mutate(nurse = rep(c(1,0),n()/2)) # Add doctor/nurse dummy for each industry to facilitate merge
  
# Merge and calculate share of occupation per industry
cps_nurse_doctor_industry <- empty %>%
  full_join(l_cps_i, l_cps_io, by=c("ind","date","nurse")) %>% 
  full_join(l_cps_io, by=c("ind","date","nurse")) %>%
  arrange(ind, date, nurse) %>%
  mutate(rho_io = lc_io/lc_i) %>% # divide worker by occupation-industry by total per industry
  replace(is.na(.), 0) # replace NAs with zeros 
rm(empty)

# Calculate share of industry per occupation
cps_nurse_doctor_industry <- cps_nurse_doctor_industry %>% 
  group_by(date,nurse) %>% # aggregate by occupation-date
  mutate(lc_o = sum(lc_io)) %>% # sum across industries
  ungroup() %>% mutate(sigma_io = lc_io/lc_o) # calculate share

There are a lot of zeros and low shares, this happen because the industry is considered if at least one worker in one period is employed in that industry according to the CPS. Nevertheless, the top 5 shares for each occupation are what we expected, suggesting we are on the right track.

To merge the CPS with the QCEW, I will only consider industries where the average share across periods for a given occupation is greater than or equal to 0.001, i.e., \(s_{io}\ge0.001\). There are 129 CPS industry codes for nurses and doctors. These codes are more aggregate than the NAICS used at the QCEW, so I have to use the CPS industry code dictionary to identify which NAICS codes I will consider in the QCEW, this will have to be done manually since the industry code correspondence is published as a PDF. There are 171 different NAICs codes associated with the 129 CPS industry codes.

3.1.1 Top industries by share

# Top 10 industries per occupation by share

# Doctors
doctors <- cps_nurse_doctor_industry %>% filter(nurse==0) %>% group_by(ind) %>% 
  summarise(rho_io = mean(rho_io) %>% round(3),
            lc_io = mean(lc_io) %>% round(0),
            sigma_io = mean(sigma_io) %>% round(3)) %>% arrange(desc(rho_io)) %>% head(10)

doctors <- doctors %>%
  mutate(label = c("Offices of physicians", "Hospitals", "Outpatient care centers", 
                   "Offices of optometrists", "Other health care services",
                   "Offices of other health practitioners", "Colleges and universities, including junior colleges",
                   "Administration of human resource programs", "Offices of chiropractors",
                   "Scientific research and development services")) %>%
  select(ind, label, everything())

kable(doctors, caption = "Doctors: top 10 industries by share",
      col.names = c("Industry code", "Industry label","$\\rho$", 
                    "Workers", "$\\sigma$"), escape=F) %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "250px")
Doctors: top 10 industries by share
Industry code Industry label \(\rho\) Workers \(\sigma\)
7970 Offices of physicians 0.226 365990 0.390
8190 Hospitals 0.061 404712 0.420
8090 Outpatient care centers 0.040 56675 0.059
8070 Offices of optometrists 0.030 3655 0.004
8180 Other health care services 0.025 39449 0.042
8080 Offices of other health practitioners 0.013 3869 0.004
7870 Colleges and universities, including junior colleges 0.009 32158 0.034
9480 Administration of human resource programs 0.009 9668 0.010
7990 Offices of chiropractors 0.006 930 0.001
7460 Scientific research and development services 0.005 2661 0.003
# Nurses
nurses <- cps_nurse_doctor_industry %>% filter(nurse==1) %>% group_by(ind) %>% 
  summarise(rho_io = mean(rho_io) %>% round(3),
            lc_io = mean(lc_io) %>% round(0),
            sigma_io = mean(sigma_io) %>% round(3)) %>% arrange(desc(rho_io)) %>% head(10)

nurses <- nurses %>% 
  mutate(label = c("Home health care services", "Nursing care facilities",
              "Hospitals", "Other health care services", "Outpatient care centers",
              "Offices of physicians", "Residential care facilities, without nursing",
              "Offices of other health practitioners", "Employment services",
              "Administration of human resource programs")) %>%
  select(ind, label, everything())


kable(nurses, caption = "Nurses: top 10 industries by share",
      col.names = c("Industry code", "Industry label","$\\rho$", 
                    "Workers", "$\\sigma$"), escape=F) %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "250px")
Nurses: top 10 industries by share
Industry code Industry label \(\rho\) Workers \(\sigma\)
8170 Home health care services 0.581 668969 0.119
8270 Nursing care facilities 0.559 964026 0.175
8190 Hospitals 0.373 2436309 0.435
8180 Other health care services 0.218 336128 0.060
8090 Outpatient care centers 0.177 255634 0.045
7970 Offices of physicians 0.148 239982 0.043
8290 Residential care facilities, without nursing 0.136 112796 0.020
8080 Offices of other health practitioners 0.101 31333 0.006
7580 Employment services 0.092 88611 0.016
9480 Administration of human resource programs 0.081 82194 0.015
# Store industry code and labels to plot later
main_industries <- rbind(nurses %>% select(ind,label),
                         doctors %>% select(ind,label))

rm(doctors,nurses)

3.1.2 Top industries by workers

Doctors: top 10 industries by workers
Industry code Industry label \(\rho\) Workers \(\sigma\)
8190 Hospitals 0.061 404712 0.420
7970 Offices of physicians 0.226 365990 0.390
8090 Outpatient care centers 0.040 56675 0.059
8180 Other health care services 0.025 39449 0.042
7870 Colleges and universities, including junior colleges 0.009 32158 0.034
9480 Administration of human resource programs 0.009 9668 0.010
8080 Offices of other health practitioners 0.013 3869 0.004
8070 Offices of optometrists 0.030 3655 0.004
9470 Justice, public order, and safety activities 0.001 3182 0.003
7460 Scientific research and development services 0.005 2661 0.003
Nurses: top 10 industries by workers
Industry code Industry label \(\rho\) Workers \(\sigma\)
8190 Hospitals 0.373 2436309 0.435
8270 Nursing care facilities 0.559 964026 0.175
8170 Home health care services 0.581 668969 0.119
8180 Other health care services 0.218 336128 0.060
8090 Outpatient care centers 0.177 255634 0.045
7970 Offices of physicians 0.148 239982 0.043
8290 Residential care facilities, without nursing 0.136 112796 0.020
7580 Employment services 0.092 88611 0.016
7860 Elementary and secondary schools 0.009 82682 0.015
9480 Administration of human resource programs 0.081 82194 0.015

3.1.3 Average share over time

Below I plot the time series of the average for both shares defined above, by industry.

# CPS industry codes for nurses and doctors
nurse_doctor_codes <- cps_nurse_doctor_industry %>% group_by(ind,nurse) %>% 
  summarise(rho_io = mean(rho_io) %>% round(3)) %>% arrange(nurse,desc(rho_io)) %>% # calculate average share across dates
  filter(rho_io>=0.001) # filter to shares >= 0.001
# nurse_doctor_codes$ind %>% unique %>% length

# export codes
write.csv(data.frame(cps_ind = nurse_doctor_codes$ind %>% unique),
          "../input_data/cps_nurses_doctors_ind.csv", row.names = F)
rm(nurse_doctor_codes)

4 QCEW

The QCEW data is organized as follows. There is one spreadsheet per year per industry, in each of this, there is employment data per month, state, and county. Within the state, employed workers are divided by employer (Federal Government, Private Sector, Local Government, etc.).

The CPS uses industry classification codes for detailed industry based on the 2017 North American Industry Classification System (NAICS) but more aggregate. The QCEW uses the NAICS of different years to classify industries in different years (more details here). To merge these datasets I manually aggregate the QCEW series using the CPS corresponding codes, available at this online attachment [link].

After manually matching the 129 CPS industry codes to 171 NAICS codes, for nurses and doctors, I can follow the steps below to calculate the scaling factors and benchmark the CPS with the QCEW.

  1. Download the QCEW data from 2003 to 2023, i.e., 21 years of data;
  2. For each of the 171 NAICs codes and year, sum the total number of workers by month across states and employers, i.e., 171*21=3591 data files to be processed;
  3. Aggregate the NAICs codes according to the CPS industry codes;
  4. Calculate ratio of total number of workers per industry dividing the workers per industry according to the CPS by the same number according to the QCEW.

4.1 Calculate workers per NAICs

Import the industry codes and see how many files I can match in the QCEW.

# Import industry code dictionary
ind_codes <- read.csv("../input_data/cps_qcew_ind.csv")

Aggregate workers across employers and states.

# Function to process CSV
process_csv <- function(year, industry, quiet=F) {
  
  # Import CSV and edit
  year_folder <- paste0("../qcew_data/",year,"_qtrly_by_industry/", 
                        year,ifelse(year<2023,".q1-q4",".q1-q1"),".by_industry/")
  
  # Condition on finding file for given industry
  missing <- list.files(year_folder,
           pattern = paste0(".* ", industry, " .*")) %>% is_empty
  
  if(missing==F) {
    
  # Process data 
  ind_df <- read.csv(paste0(year_folder, list.files(year_folder,
              pattern = paste0(".* ", industry, " .*")))) %>% # import csv
    filter(str_detect(area_title, "Statewide")) %>% # keep only aggregate by state
    select(year, qtr, industry_code, month1_emplvl, month2_emplvl, month3_emplvl) %>% # keep variables of interest
    group_by(qtr) %>% # group by quarter to collapse summing workers
    summarise(year = year[1], ind = industry_code[1], # keep year and industry
              month1_empl = sum(month1_emplvl), month2_empl = sum(month2_emplvl),
              month3_empl = sum(month3_emplvl)) %>% # sum number of workers
    pivot_longer(cols = starts_with("month"), # reshapes data to have it by industry-date
                 names_to = "month", values_to = "empl", 
                 names_pattern = "month(.)_empl") %>%
    mutate(date = paste(year,1:ifelse(year<2023,12,3),"1",sep="-") %>% as.Date("%Y-%m-%d")) %>% # create date variable
    select(date, ind, empl) # keep variables of interest
  
  # Save clean processed csv
  write.csv(ind_df, paste0("../qcew_data/", year, "_qtrly_by_industry/", 
                           year, "_", industry, ".csv"), row.names = F)
  } 
  else { # print the ones we are not able to match
    if(quiet==F) { 
      paste0("For year ", year," we cannot match NAICS code ",
                 industry, " to any QCEW file.") %>% print
    }
  }
}

This part of the code is commented out because it is inefficient to run it every time. The csv files were processed already and the next code chunk uses them directly, skipping this part.

# # Apply function for all industries and years
# for (y in 2003:2023) {
#   for (i in 1:length(ind_codes$naics)) {
#     process_csv(y, ind_codes$naics[i], quiet=T)
#   }
# }

For every year, there are 5 NAICs codes that I cannot match with the QCEW files but they are not healthcare related and have low share of nurses and doctors (less than or equal to 0.003).

Lastly, I append all the separate QCEW year-industry datasets created above.

# First dataset to start appending all others
qcew <- read.csv(paste0("../qcew_data/2003_qtrly_by_industry/",
                        list.files("../qcew_data/2003_qtrly_by_industry/")[2]))

# Append every industry for every year
for(y in 2003:2023) { # loop starts at third file for 2003 because second is above, first is dir with raw data
  for(i in ifelse(y>2003,2,3):length(list.files(paste0("../qcew_data/",y,"_qtrly_by_industry/")))) {
    qcew <- qcew %>% 
      rbind(read.csv(paste0("../qcew_data/",y,"_qtrly_by_industry/",
                      list.files(paste0("../qcew_data/",y,"_qtrly_by_industry/"))[i])))
  }
}

# Sort and rename
qcew <- qcew %>% arrange(ind, date) %>% rename(naics=ind)

4.2 Calculate workers per CPS industry

Merge QCEW data with industry code dictionary and sum workers across NAICs codes for a given CPS industry category.

# Merge with CPS codes and sum by industry-date
qcew <- left_join(qcew, ind_codes %>% select(cps_ind,naics), by = "naics") %>%
  group_by(cps_ind, date) %>% summarise(lq_i = sum(empl)) %>% # sum workers by CPS industry code
  rename(ind = cps_ind) %>% mutate(date = as.Date(date))

5 CPS vs. QCEW

Given all the procedures above, now I can merge the CPS with the QCEW to compare them and benchmark the former with the latter.

# Merge
cps_qcew <- inner_join(cps_nurse_doctor_industry, # merging keeping only matched observations!
                      qcew, by = c("ind", "date")) %>%
  mutate(xi = lc_i/lq_i) %>% # calculate scaling factor
  mutate(xi_w1 = winsorize(xi,.01), xi_w5 = winsorize(xi,.05), 
         xi_w10 = winsorize(xi,.1)) %>% # winsorized scaling factor
  mutate(lc_io_hat = xi*lc_io, lc_io_hat_w1 = xi_w1*lc_io,
         lc_io_hat_w5 = xi_w5*lc_io, lc_io_hat_w10 = xi_w10*lc_io,# calculate scaled l^cps_io
         lq_io_hat = rho_io*lq_i) # calculate share-weighted QCEW

# Collapse
nurses_doctors_collapsed <- cps_qcew %>% group_by(nurse, date) %>% 
  summarise(lc_o = sum(lc_io), lc_o_hat = sum(lc_io_hat), 
            lc_o_hat_w1 = sum(lc_io_hat_w1), lc_o_hat_w5 = sum(lc_io_hat_w5),
            lc_o_hat_w10 = sum(lc_io_hat_w10), lq_o_hat = sum(lq_io_hat))

6 Scaling factor, \(\xi_{it}\)

Below we can see the summary statistics for \(\xi_{it}\). There are a lot of zeros and small values, with a median of 1.03 and 75th percentile of 1.46. However, there are also some very high values with a maximum of 2092.94, bringing the average to 7.57. Given these extreme values, I also present summary statistics and histograms for the variable winsorized at different percentiles.

The weighted average uses the number of doctors employed in a given industry in the first month of the sample period (January 2003). The idea is to place more weight on industries that have more doctors, avoid using time-varying weights and avoid having the weights related to the variables that go into measuring \(\xi_{it}\) itself.

Note: I am winsorizing across the whole distribution, not within industry. I am not completely sure this is the right approach.

# Create winsorized versions of xi
cps_qcew <- cps_qcew %>% 
  mutate(xi_w1 = winsorize(xi,.01), 
         xi_w5 = winsorize(xi,.05),
         xi_w10 = winsorize(xi,.1))
Summary statistics for scaling factor
Variable Min Q1 Median Mean Q3 Max Weighted mean SD
\(\xi_{it}\) 0 0.618 1.04 7.698 1.473 2092.94 1.087 76.826
\(\xi_{it}\) winsorized at 1/99% 0 0.618 1.04 1.284 1.473 10.449 1.081 1.428
\(\xi_{it}\) winsorized at 5/95% 0.037 0.618 1.04 1.104 1.473 2.693 1.046 0.658
\(\xi_{it}\) winsorized at 10/90% 0.335 0.618 1.04 1.077 1.473 2.003 1.019 0.53

The histograms below show the distribution for \(\xi_{it}\) without any weighting.

If we look at the average \(\xi_{it}\) overtime, we can see that the higher values are in the early 2000s, but there are some spikes across the whole sample period. The average is notably low for more recent periods. The median, however, is close to one for the whole sample period. If we weight this average by the number of doctors in each industry in January 2003, however, there is still a downward trend but the values are closer to one for the whole sample period.

7 Missing workers

7.1 By Occupation

Below we can see a comparison of the raw CPS with the CPS benchmarked by the QCEW for nurses and doctors, i.e., we can see \(l^c_{ot}\) and \(\hat{l}^c_{ot}\) for these occupations. I also plot the linear time trend that we would use to calculate the number of missing workers. There are no missing nurses using the benchmarked CPS without winsorization, only when we winsorize the scaling factor at the 5/95th and 10/90th percentiles. There are missing doctors with and without winsorization but there is also a very odd data point before 2020, which may be misleading and affecting the slope of the time trend.

# Figures by occupation
df_plot <- nurses_doctors_collapsed %>%
  pivot_longer(cols = c("lc_o_hat", "lc_o_hat_w1", "lc_o_hat_w5", # reshapes
                        "lc_o_hat_w10", "lc_o", "lq_o_hat"), 
              names_to = "type", values_to = "l_o") 

# Calculate linear trend
calc_exp_values <- function(nurse_reg, var) {
  # Run linear regression from Jan2013 to Jan2020
  regression <- 
    lm(l_o ~ date, 
       data = df_plot %>% 
         filter(date %within% interval(ymd("2013-01-01"),ymd("2020-01-01")) & 
                              nurse==nurse_reg & type==var))
  # Calculate expected average for last 6 months
  exp_values <- regression$coefficients["(Intercept)"] + 
    regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
  # Add to dataset
  df_plot$l_o_reg[df_plot$nurse==nurse_reg & df_plot$type==var] <<- exp_values
}
for(i in 0:1) {
  for(v in c("lc_o",paste0("lc_o_hat",c("","_w1","_w5","_w10")),"lq_o_hat")) {
    calc_exp_values(i,v)
  }
}

7.2 By Industry

Here, I plot the CPS, the QCEW, the benchmarked CPS, and the CES for the main industry codes of interest, i.e., the ones with higher share of nurses and doctors. I calculate the benchmarked CPS at the industry level as \(\hat{l}^c_{it}=\frac{1}{\xi_{it}} l^q_{it}\). This also work as a sanity check to confirm that \(\hat{l}^c_{it}=\hat{l}^q_{it}\), which should be true by design. The QCEW seems to be smoother than the CPS, as expected. I did not add the CES to all industries, but I can do it later if necessary.

For some industries, the values are notably different, namely, offices of physicians, outpatient care centers, and other health care services.

# Read CES data
ces <- read_csv("../../CES/data/ces_selected_industries.csv") 

# Merge with CPS and QCEW
cps_qcew <- cps_qcew %>%
  left_join(ces, by = c("ind","date"))
# Figures by occupation
df_plot <- cps_qcew %>% mutate(lc_i_hat = lc_i/xi) %>% filter(nurse==1) %>% # by industry, so keep only one occ
  select(ind, date, lc_i, lc_i_hat, lq_i, le_i) %>%
  pivot_longer(cols = c("lc_i", "lc_i_hat", "lq_i", "le_i"), # reshapes
              names_to = "type", values_to = "l_i") 

# Calculate linear trend
calc_exp_values <- function(industry, var) {
  # Run linear regression from Jan2013 to Jan2020
  df_reg <- df_plot %>% # filter for interval, var, and industry of interest
         filter(date %within% interval(ymd("2013-01-01"),ymd("2020-01-01")) & 
                  type==var & ind==industry)
  df_plot$l_i_reg <- NA
  if(all(is.na(df_reg$l_i))==F) {
    regression <- lm(l_i ~ date, data = df_reg)
    # Calculate expected average for last 6 months
    exp_values <- regression$coefficients["(Intercept)"] + 
      regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
    # Add to dataset
    df_plot$l_i_reg[df_plot$type==var & df_plot$ind==industry] <<- exp_values
  }
}
for(i in main_industries$ind) {
  for(v in df_plot$type %>% unique) {
    calc_exp_values(i,v)
  }
}

7.3 Different extrapolations

Here, I plot missing workers by occupation using only the CPS and extrapolating the pre-pandemic trends with regressions starting on 2013, 2015, and 2017.

df_plot <- nurses_doctors_collapsed %>% select(nurse, date, lc_o)

# Calculate linear trend
calc_exp_values <- function(nurse_reg, start_year) {
  # Run linear regression from Jan2013 to Jan2020
  regression <- 
    lm(lc_o ~ date, 
       data = df_plot %>% 
         filter(date %within%
                  interval(ymd(paste0(start_year,"-01-01")),ymd("2020-01-01")) & 
                              nurse==nurse_reg))
  # Calculate expected average for last 6 months
  exp_values <- regression$coefficients["(Intercept)"] + 
    regression$coefficients["date"]*as.numeric(df_plot$date %>% unique)
  # Add to dataset
  assign(paste0("df_plot$lc_o_reg_",start_year-2000,"[df_plot$nurse==nurse_reg]"),
         exp_values)
  if(start_year==2013) {
    df_plot$lc_o_reg_13[df_plot$nurse==nurse_reg] <<- exp_values
  }
  if(start_year==2015) {
    df_plot$lc_o_reg_15[df_plot$nurse==nurse_reg] <<- exp_values
  }
  if(start_year==2017) {
    df_plot$lc_o_reg_17[df_plot$nurse==nurse_reg] <<- exp_values
  }
}
for(i in 0:1) {
  for(y in c(2013,2015,2017)) {
    calc_exp_values(i,y)
  }
}

7.4 Rolling averages

Figures with 3- and 6-month rolling averages of the CPS are presented below. The first two figures are for the 3-month rolling average.

for(k in c(3,6)) {
  # Calculate rolling average
  df_plot <- nurses_doctors_collapsed %>% select(nurse, date, lc_o) %>%
    mutate(lc_o = rollmean(lc_o, k, fill=NA, align='right'))
  # Calculate linear extrapolations using rolling average
  for(i in 0:1) {
    for(y in c(2013,2015,2017)) {
      calc_exp_values(i,y)
    }
  }
  # Plots
  plot_missing_workers(0) %>% plot
  plot_missing_workers(1) %>% plot
}