1 Summary

In this report, I calculate and plot employment trends for doctors and nurses by unionization statys using data from the Current Population Survey (CPS) from January 2003 to September 2024, the latest data available as of October 2024.

The variable of interest in this report is UNION, which indicates whether, for the current job, the respondent was: 1) a member of a labor union or employee association similar to a union; 2) not a union member but covered by a union or employee association contract; or 3) neither a union member nor covered by a union contract.

For this analysis, I merge together members of a labor union as well as those who are not members but are covered by a union. I also use the EARNWT weight, as recommended by CPS documentation.

These are the following steps I took for this task:

  1. I downloaded basic monthly cross-sectional data from January 2003 to August 2024 for outgoing rotation groups 4 and 8 only. Households in the CPS are interviewed for four consecutive months, then have an eight-month break, followed by another four months of interviews. In the fourth and eighth month of their interview cycle (when they are about to enter the eight-month break or exit the survey permanently), households are asked additional labor-related questions, including those related to unionization. The specific variables of interest for this analysis are the following: YEAR, MONTH, EMPSTAT, OCC2010, UNION, and EARNWT.
  2. With all samples downloaded, I created a subset of the data for only respondents that indicated they were actively working, or had a job but had not been at work for the last week relative to the survey date. I also exclude cases marked with the ASEC flag. The ASEC includes all respondents from the March Basic Monthly Survey, along with additional oversamples from other months (refer to more information on oversamples), thus it doublecounts respondents.
  3. Next, I created flags for nurses and doctors.
  • Nurses were flagged if OCC2010 was 3500 - Licensed Practical and Licensed Vocational Nurses, 3600 - Nursing, Psychiatric, and Home Health Aides, or 3130 - Registered Nurses.
  • Doctors were flagged if OCC2010 was 3060 - Physicians and Surgeons.
  1. Using the EARNWT weight, I calculated the monthly count and monthly log count for each month-year-unionization observation
  2. I then calculated the linear time trend during COVID by running a simple linear regression on log unionization for nurses and doctors separately between January 2013 to January 2020.
  3. Finally, I plot the missing workers by occupation and extrapolate the pre-pandemic trends from 2020 onwards.

2 Code

# adding union labels - note: for the purposes of this exercise, i am merging code 2 - "Member of labor union" and code 3 - "Covered by union but not a member" into a single code - "Union coverage". 


data <- data %>% 
  mutate(UNION = case_when(
    UNION == 1 ~ "No union coverage",
    UNION == 2 ~ "Union coverage",
    UNION == 3 ~ "Union coverage",
    TRUE ~ NA_character_
  )) %>% 
  mutate(date = paste(YEAR, MONTH, "1",sep="-") %>% as.Date("%Y-%m-%d")) # formats date


# now creating nurse and doctor flags and filtering only for nurses and doctors
hc.df <- data %>% 
  filter(ASECFLAG == 2 | is.na(ASECFLAG) == TRUE) %>% 
  filter(EMPSTAT %in% c(10,12)) %>% # "At work" or "Has job, not at work last week"
  mutate(nurse = ifelse(OCC2010 %in% c(3500,3600,3130),1,0), # create nurse dummy
         doctor = ifelse(OCC2010==3060,1,0)) %>% # create doctor dummy
  filter(nurse==1 | doctor==1) # keep only doctors and nurses


# unionized workers per occupation-industry (# of unionized doctors/nurses in each industry)
# i use EARNWT, which is per CPS documentation "a person-level weight that should be used in any analysis including one of the following variables: EARNWEEK, HOURWAGE, PAIDHOUR, UNION, UHRSWORKORG, WKSWORKORG, ELIGORG, and OTPAY. For any other analysis using ASEC data, researchers should use ASECWT or for analyses of non-ASEC data, WTFINL."

# union <- hc.df %>% 
#   filter(is.na(UNION) == FALSE) %>% 
#   group_by(date, nurse, UNION) %>% 
#   summarise(count = sum(EARNWT, na.rm = T)) %>%
#   mutate(log_count = log(count)) %>%
#   arrange(date,UNION)

# creating df for nurse unionization only
nurses <- hc.df %>% 
  filter(is.na(UNION) == FALSE & nurse == 1) %>% 
  group_by(date, UNION) %>% 
  summarise(count = sum(EARNWT, na.rm = T)) %>%
  mutate(log_count = log(count),
         year = as.numeric(format(date, "%Y")))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
# creating df for doctor unionization only
doctors <- hc.df %>% 
  filter(is.na(UNION) == FALSE & nurse == 0) %>% 
  group_by(date, UNION) %>% 
  summarise(count = sum(EARNWT, na.rm = T)) %>%
  mutate(log_count = log(count),
         year = as.numeric(format(date, "%Y")))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
# creating regressions for doctors by unionization status

# doctors with union coverage
d.cov <- doctors %>% 
  filter(UNION == "Union coverage")

# log-linear model for log-transformed count
dreg.cov <- lm(log_count ~ date, 
       data = d.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))


# predicting log-transformed values and exponentiate to original scale
d.cov$log_exp_values <- predict(dreg.cov, newdata = d.cov)
d.cov$exp_values <- exp(d.cov$log_exp_values)              

# doctors with no union coverage

d.nocov <- doctors %>% 
  filter(UNION == "No union coverage")

# log-linear model for log-transformed count
dreg.nocov <- lm(log_count ~ date, 
       data = d.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

# predicting log-transformed values and exponentiate to original scale
d.nocov$log_exp_values <- predict(dreg.nocov, newdata = d.nocov)  
d.nocov$exp_values <- exp(d.nocov$log_exp_values)                 

# plotting 

d.plot <- ggplot(doctors, aes(x = date, y = log_count, color = UNION)) +
  
   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=d.cov %>% filter(date>=ymd("2013-01-01"))) + 
   
   geom_line(aes(y=log_exp_values, x=date), color="red", linetype="dotted", linewidth=.8,
              data=d.nocov %>% filter(date>=ymd("2013-01-01"))) + 
  
  geom_line(linewidth = 0.5) +
  labs(title = "Log Employment of Doctors by Unionization Status (Jan. 2003 to Sept. 2024)",
       x = "",
       y = "Log Count") +
  
  scale_color_manual(values = c("#9b494f","#31597d"), 
                       c(name="")) +
  
  geom_vline(xintercept = 2020-01-01, linetype = "dashed", color = "grey", linewidth = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
# creating table to show missing workers, total workers, predicted total workers, and percent missing

# union covered
d1.table <- d.cov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") 

# not covered by union
d2.table <- d.nocov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") 



# Calculate Missing Workers and Percent Missing first
d.table <- bind_rows(d1.table, d2.table) %>%
  mutate(
    `Missing Workers` = `Predicted Total Workers` - `Total Workers`,
    `Percent Missing` = round((`Missing Workers` / `Predicted Total Workers`) * 100, 2))

# Add the total row at the end with calculated sums
d.table <- d.table %>%
  add_row(
    `Union Status` = "Total",
    `Missing Workers` = sum(d.table$`Missing Workers`, na.rm = TRUE),
    `Total Workers` = sum(d.table$`Total Workers`, na.rm = TRUE),
    `Predicted Total Workers` = sum(d.table$`Predicted Total Workers`, na.rm = TRUE),
    `Percent Missing` = round((`Missing Workers` / `Predicted Total Workers`) * 100, 2))



d.table <- d.table %>% 
  kable(
    caption = "Table 1: Missing Doctors by Unionization Status",
    col.names = c("Industry", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing"),
    format = "html", 
    digits = c(0, 0, 0, 0, 2),
      format.args = list(big.mark = ",")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing doctors by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).")
doctors.comb <- doctors %>%
  group_by(date) %>%
  summarise(count = sum(count)) %>% 
  mutate(log_count = log(count))


# log-linear model for log-transformed count
dreg.comb <- lm(log_count ~ date, data = doctors.comb %>% 
                filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

# predicting log-transformed values and exponentiate to original scale
doctors.comb$log_exp_values <- predict(dreg.comb, newdata = doctors.comb)
doctors.comb$exp_log_values <- exp(doctors.comb$log_exp_values) 



# plotting 

d.comb.plot <- ggplot(doctors.comb, aes(x = date, y = log_count)) +
   

   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=doctors.comb %>% filter(date>=ymd("2013-01-01"))) + 
   
  geom_line(linewidth = 0.5) +
   
  labs(title = "Log Employment of Doctors (Jan. 2003 to Sept. 2024)",
       x = "",
       y = "Log Count") +
  geom_vline(xintercept = as.Date("2020-01-01"), linetype = "dashed", color = "grey", linewidth = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
# creating table to show missing workers, total workers, predicted total workers, and percent missing

d3.table <- doctors.comb %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_log_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  select("Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2),
         " " = "Nurses") %>% 
  select(" ", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  kable(
    caption = "Table 3: Missing Doctors - Past 6 Months Average",
    format = "html", 
    digits = c(0, 0, 0, 0, 2),
      format.args = list(big.mark = ",")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing doctors. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).")
# creating regressions for nurses by unionization status

# nurses with union coverage

n.cov <- nurses %>% 
  filter(UNION == "Union coverage")

# log-linear model for log-transformed count
nreg.cov <- lm(log_count ~ date, 
       data = n.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

# predicting log-transformed values and exponentiate to original scale
n.cov$log_exp_values <- predict(nreg.cov, newdata = n.cov)
n.cov$exp_values <- exp(n.cov$log_exp_values) 




# nurses with no union coverage

n.nocov <- nurses %>% 
  filter(UNION == "No union coverage")

nreg.nocov <- lm(log_count ~ date, 
       data = n.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

# predicting log-transformed values and exponentiate to original scale
n.nocov$log_exp_values <- predict(nreg.nocov, newdata = n.nocov)  
n.nocov$exp_values <- exp(n.nocov$log_exp_values)               


# plotting 

n.plot <- ggplot(nurses, aes(x = date, y = log_count, color = UNION)) +
   

   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=n.cov %>% filter(date>=ymd("2013-01-01"))) + 
   
   
   geom_line(aes(y=log_exp_values, x=date), color="red", linetype="dotted", linewidth=.8,
              data=n.nocov %>% filter(date>=ymd("2013-01-01"))) + 
  
  geom_line(linewidth = 0.5) +
   
  labs(title = "Log Employment of Nurses by Unionization Status (Jan. 2003 to Sept. 2024)",
       x = "",
       y = "Log Count") +
  scale_color_manual(values = c("#9b494f","#31597d"), 
                       c(name="")) +
  geom_vline(xintercept = as.Date("2020-01-01"), linetype = "dashed", color = "grey", linewidth = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
# creating table to show missing workers, total workers, predicted total workers, and percent missing

n1.table <- n.cov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2))


n2.table <- n.nocov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2))




# Calculate Missing Workers and Percent Missing first
n.table <- bind_rows(n1.table, n2.table) %>%
  mutate(
    `Missing Workers` = `Predicted Total Workers` - `Total Workers`,
    `Percent Missing` = round((`Missing Workers` / `Predicted Total Workers`) * 100, 2)
  )

# Add the total row at the end with calculated sums
n.table <- n.table %>%
  add_row(
    `Union Status` = "Total",
    `Missing Workers` = sum(n.table$`Missing Workers`, na.rm = TRUE),
    `Total Workers` = sum(n.table$`Total Workers`, na.rm = TRUE),
    `Predicted Total Workers` = sum(n.table$`Predicted Total Workers`, na.rm = TRUE),
    `Percent Missing` = round((`Missing Workers` / `Predicted Total Workers`) * 100, 2))


n.table <- n.table %>% 
  kable(
    caption = "Table 2: Missing Nurses by Unionization Status - 6 Month Averages",
    col.names = c("Industry", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing"),
    format = "html", 
    digits = c(0, 0, 0, 0, 2),
      format.args = list(big.mark = ",")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing nurses by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).")
nurses.comb <- nurses %>%
  group_by(date) %>%
  summarise(count = sum(count)) %>% 
  mutate(log_count = log(count))


# log-linear model for log-transformed count
nreg.comb <- lm(log_count ~ date, data = nurses.comb %>% 
                filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

# predicting log-transformed values and exponentiate to original scale
nurses.comb$log_exp_values <- predict(nreg.comb, newdata = nurses.comb)
nurses.comb$exp_log_values <- exp(nurses.comb$log_exp_values) 



# plotting 

n.comb.plot <- ggplot(nurses.comb, aes(x = date, y = log_count)) +
   

   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=nurses.comb %>% filter(date>=ymd("2013-01-01"))) + 
   
  geom_line(linewidth = 0.5) +
   
  labs(title = "Log Employment of Nurses (Jan. 2003 to Sept. 2024)",
       x = "",
       y = "Log Count") +
  geom_vline(xintercept = as.Date("2020-01-01"), linetype = "dashed", color = "grey", linewidth = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
# creating table to show missing workers, total workers, predicted total workers, and percent missing

n3.table <- nurses.comb %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  summarise(
    "Total Workers" = mean(count, na.rm = TRUE),
    "Predicted Total Workers" = mean(exp_log_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  select("Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2),
         " " = "Nurses") %>% 
  select(" ", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  kable(
    caption = "Table 3: Missing Nurses - Past 6 Months Average",
    format = "html", 
    digits = c(0, 0, 0, 0, 2),
      format.args = list(big.mark = ",")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing nurses. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).")

3 Figures and Tables

3.1 Doctors

Table 1: Missing Doctors by Unionization Status
(1)
(2)
(3)
(4)
Industry Missing Workers Total Workers Predicted Total Workers Percent Missing
Union coverage 93,223 62,152 155,375 60.00
No union coverage 175,945 803,184 979,129 17.97
Total 269,168 865,337 1,134,504 23.73
Note:
The table above reports the number of missing doctors by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).
Table 3: Missing Doctors - Past 6 Months Average
(1)
(2)
(3)
(4)
Missing Workers Total Workers Predicted Total Workers Percent Missing
Nurses 272,186 865,337 1,137,523 23.93
Note:
The table above reports the number of missing doctors. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).

3.2 Nurses

Table 2: Missing Nurses by Unionization Status - 6 Month Averages
(1)
(2)
(3)
(4)
Industry Missing Workers Total Workers Predicted Total Workers Percent Missing
Union coverage -151,205 1,010,531 859,326 -17.60
No union coverage 228,778 5,473,760 5,702,538 4.01
Total 77,574 6,484,291 6,561,864 1.18
Note:
The table above reports the number of missing nurses by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).
Table 3: Missing Nurses - Past 6 Months Average
(1)
(2)
(3)
(4)
Missing Workers Total Workers Predicted Total Workers Percent Missing
Nurses 78,924 6,484,291 6,563,215 1.2
Note:
The table above reports the number of missing nurses. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers. The numbers presented are an average of the last 6 months (April 2024 - September 2024).