1 Summary

In this report, I calculate and plot employment trends for doctors and nurses by unionization statys using data from the Current Population Survey (CPS) from January 2003 to August 2024, the latest data available as of October 2024.

The variable of interest in this report is UNION, which indicates whether, for the current job, the respondent was: 1) a member of a labor union or employee association similar to a union; 2) not a union member but covered by a union or employee association contract; or 3) neither a union member nor covered by a union contract.

For this analysis, I merge together members of a labor union as well as those who are not members but are covered by a union. I also use the EARNWT weight, as recommended by CPS documentation.

These are the following steps I took for this task:

  1. I downloaded basic monthly cross-sectional data from January 2003 to August 2024 for outgoing rotation groups 4 and 8 only. Households in the CPS are interviewed for four consecutive months, then have an eight-month break, followed by another four months of interviews. In the fourth and eighth month of their interview cycle (when they are about to enter the eight-month break or exit the survey permanently), households are asked additional labor-related questions, including those related to unionization. The specific variables of interest for this analysis are the following: YEAR, MONTH, EMPSTAT, OCC2010, UNION, and EARNWT.
  2. With all samples downloaded, I created a subset of the data for only respondents that indicated they were actively working, or had a job but had not been at work for the last week relative to the survey date. I also exclude cases marked with the ASEC flag. The ASEC includes all respondents from the March Basic Monthly Survey, along with additional oversamples from other months (refer to more information on oversamples), thus it doublecounts respondents.
  3. Next, I created flags for nurses and doctors.
  • Nurses were flagged if OCC2010 was 3500 - Licensed Practical and Licensed Vocational Nurses, 3600 - Nursing, Psychiatric, and Home Health Aides, or 3130 - Registered Nurses.
  • Doctors were flagged if OCC2010 was 3060 - Physicians and Surgeons.
  1. Using the EARNWT weight, I calculated the monthly count and monthly log count for each month-year-unionization observation
  2. I then calculated the linear time trend during COVID by running a simple linear regression on log unionization for nurses and doctors separately between January 2013 to January 2020.
  3. Finally, I plot the missing workers by occupation and extrapolate the pre-pandemic trends from 2020 onwards.

2 Code

# adding union labels - note: for the purposes of this exercise, i am merging code 2 - "Member of labor union" and code 3 - "Covered by union but not a member" into a single code - "Union coverage". 


data <- data %>% 
  mutate(UNION = case_when(
    UNION == 1 ~ "No union coverage",
    UNION == 2 ~ "Union coverage",
    UNION == 3 ~ "Union coverage",
    TRUE ~ NA_character_
  )) %>% 
  mutate(date = paste(YEAR, MONTH, "1",sep="-") %>% as.Date("%Y-%m-%d")) # formats date


# now creating nurse and doctor flags and filtering only for nurses and doctors
hc.df <- data %>% 
  filter(ASECFLAG == 2 | is.na(ASECFLAG) == TRUE) %>% 
  filter(EMPSTAT %in% c(10,12)) %>% # "At work" or "Has job, not at work last week"
  mutate(nurse = ifelse(OCC2010 %in% c(3500,3600,3130),1,0), # create nurse dummy
         doctor = ifelse(OCC2010==3060,1,0)) %>% # create doctor dummy
  filter(nurse==1 | doctor==1) # keep only doctors and nurses


# unionized workers per occupation-industry (# of unionized doctors/nurses in each industry)
# i use EARNWT, which is per CPS documentation "a person-level weight that should be used in any analysis including one of the following variables: EARNWEEK, HOURWAGE, PAIDHOUR, UNION, UHRSWORKORG, WKSWORKORG, ELIGORG, and OTPAY. For any other analysis using ASEC data, researchers should use ASECWT or for analyses of non-ASEC data, WTFINL."

# union <- hc.df %>% 
#   filter(is.na(UNION) == FALSE) %>% 
#   group_by(date, nurse, UNION) %>% 
#   summarise(count = sum(EARNWT, na.rm = T)) %>%
#   mutate(log_count = log(count)) %>%
#   arrange(date,UNION)

# creating df for nurse unionization only
nurses <- hc.df %>% 
  filter(is.na(UNION) == FALSE & nurse == 1) %>% 
  group_by(date, UNION) %>% 
  summarise(count = sum(EARNWT, na.rm = T)) %>%
  mutate(log_count = log(count)) %>%
  arrange(date,UNION)
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
# creating df for doctor unionization only
doctors <- hc.df %>% 
  filter(is.na(UNION) == FALSE & nurse == 0) %>% 
  group_by(date, UNION) %>% 
  summarise(count = sum(EARNWT, na.rm = T)) %>%
  mutate(log_count = log(count)) %>%
  arrange(date,UNION)
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
# creating regressions for doctors by unionization status

# doctors with union coverage
d.cov <- doctors %>% 
  filter(UNION == "Union coverage")

dreg.cov <- lm(log_count ~ date, 
       data = d.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))


dreg2.cov <- lm(count ~ date, 
       data = d.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))


d.cov$log_exp_values <- dreg.cov$coefficients["(Intercept)"] + 
    dreg.cov$coefficients["date"]*as.numeric(d.cov$date)

d.cov$exp_values <- dreg2.cov$coefficients["(Intercept)"] + 
    dreg2.cov$coefficients["date"]*as.numeric(d.cov$date)

# doctors with no union coverage

d.nocov <- doctors %>% 
  filter(UNION == "No union coverage")

dreg.nocov <- lm(log_count ~ date, 
       data = d.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

dreg2.nocov <- lm(count ~ date, 
       data = d.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))


d.nocov$log_exp_values <- dreg.nocov$coefficients["(Intercept)"] + 
    dreg.nocov$coefficients["date"]*as.numeric(d.nocov$date)
  
d.nocov$exp_values <- dreg2.nocov$coefficients["(Intercept)"] + 
    dreg2.nocov$coefficients["date"]*as.numeric(d.nocov$date)

# plotting 

d.plot <- ggplot(doctors, aes(x = date, y = log_count, color = UNION)) +
  
   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=d.cov %>% filter(date>=ymd("2013-01-01"))) + 
   
   geom_line(aes(y=log_exp_values, x=date), color="red", linetype="dotted", linewidth=.8,
              data=d.nocov %>% filter(date>=ymd("2013-01-01"))) + 
  
  geom_line(size = 0.5) +
  labs(title = "Log Employment of Doctors by Unionization Status (Jan. 2003 to Aug. 2024)",
       x = "",
       y = "Log Count") +
  
  scale_color_manual(values = c("#9b494f","#31597d"), 
                       c(name="")) +
  
  geom_vline(xintercept = 2020-01-01, linetype = "dashed", color = "grey", size = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# creating table to show missing workers, total workers, predicted total workers, and percent missing

d1.table <- d.cov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = sum(count, na.rm = TRUE),
    "Predicted Total Workers" = sum(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") 


d2.table <- d.nocov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = sum(count, na.rm = TRUE),
    "Predicted Total Workers" = sum(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") 


d.table <- bind_rows(d1.table, d2.table) %>% 
  mutate(
    `Missing Workers` = format(`Missing Workers`, big.mark = ",", scientific = FALSE),
    `Total Workers` = format(`Total Workers`, big.mark = ",", scientific = FALSE),
    `Predicted Total Workers` = format(`Predicted Total Workers`, big.mark = ",", scientific = FALSE),
    `Percent Missing` = format(`Percent Missing`, nsmall = 2, big.mark = ",", scientific = FALSE)) %>% 
  kable(
    caption = "Table 1: Outgoing Rotation Groups Only",
    col.names = c("Industry", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing"),
    format = "html", 
    digits = c(0, 0, 0, 0, 2)  # Specify number of decimal places for each column
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing workers by unionization status The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers from the last 6 months.")
# creating regressions for nurses by unionization status

# nurses with union coverage

n.cov <- nurses %>% 
  filter(UNION == "Union coverage")

nreg.cov <- lm(log_count ~ date, 
       data = n.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

nreg2.cov <- lm(count ~ date, 
       data = n.cov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

n.cov$log_exp_values <- nreg.cov$coefficients["(Intercept)"] + 
    nreg.cov$coefficients["date"]*as.numeric(n.cov$date)

n.cov$exp_values <- nreg2.cov$coefficients["(Intercept)"] + 
    nreg2.cov$coefficients["date"]*as.numeric(n.cov$date)

# nurses with no union coverage

n.nocov <- nurses %>% 
  filter(UNION == "No union coverage")

nreg.nocov <- lm(log_count ~ date, 
       data = n.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

nreg2.nocov <- lm(count ~ date, 
       data = n.nocov %>% 
         filter(between(date, as.Date('2013-01-01'), as.Date('2020-01-01'))))

n.nocov$log_exp_values <- nreg.nocov$coefficients["(Intercept)"] + 
    nreg.nocov$coefficients["date"]*as.numeric(n.nocov$date)
  
n.nocov$exp_values <- nreg2.nocov$coefficients["(Intercept)"] + 
    nreg2.nocov$coefficients["date"]*as.numeric(n.nocov$date)
  
  

# plotting 

n.plot <- ggplot(nurses, aes(x = date, y = log_count, color = UNION)) +
   

   geom_line(aes(y=log_exp_values, x=date), color="darkblue", linetype="dotted", linewidth=.8,
              data=n.cov %>% filter(date>=ymd("2013-01-01"))) + 
   
   
   geom_line(aes(y=log_exp_values, x=date), color="red", linetype="dotted", linewidth=.8,
              data=n.nocov %>% filter(date>=ymd("2013-01-01"))) + 
  
  geom_line(size = 0.5) +
   
  labs(title = "Log Employment of Nurses by Unionization Status (Jan. 2003 to Aug. 2024)",
       x = "",
       y = "Log Count") +
  scale_color_manual(values = c("#9b494f","#31597d"), 
                       c(name="")) +
  geom_vline(xintercept = as.Date("2020-01-01"), linetype = "dashed", color = "grey", size = 0.5) +
  theme(legend.position="bottom") +
  theme_stata()
# creating table to show missing workers, total workers, predicted total workers, and percent missing

n1.table <- n.cov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = sum(count, na.rm = TRUE),
    "Predicted Total Workers" = sum(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2))


n2.table <- n.nocov %>% 
  filter(between(date, as.Date("2024-04-01"), as.Date("2024-09-01"))) %>% 
  group_by(UNION) %>% 
  summarise(
    "Total Workers" = sum(count, na.rm = TRUE),
    "Predicted Total Workers" = sum(exp_values, na.rm = TRUE),
    "Missing Workers" = `Predicted Total Workers` - `Total Workers`,
    "Percent Missing" = (`Missing Workers` / `Predicted Total Workers`) * 100
  ) %>% 
  rename("Union Status" = UNION) %>%
  select("Union Status", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing") %>% 
  mutate(`Percent Missing` = round(`Percent Missing`, 2))

n.table <- bind_rows(n1.table, n2.table) %>% 
  mutate(
    `Missing Workers` = format(`Missing Workers`, big.mark = ",", scientific = FALSE),
    `Total Workers` = format(`Total Workers`, big.mark = ",", scientific = FALSE),
    `Predicted Total Workers` = format(`Predicted Total Workers`, big.mark = ",", scientific = FALSE),
    `Percent Missing` = format(`Percent Missing`, nsmall = 2, big.mark = ",", scientific = FALSE)) %>% 
  kable(
    caption = "Table 2: Outgoing Rotation Groups Only",
    col.names = c("Industry", "Missing Workers", "Total Workers", "Predicted Total Workers", "Percent Missing"),
    format = "html", 
    digits = c(0, 0, 0, 0, 2)  # Specify number of decimal places for each column
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1, "(3)" = 1, "(4)" = 1)) %>%
  footnote(
    general = "The table above reports the number of missing workers by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers from the last 6 months.")

3 Figures and Tables

Table 1: Outgoing Rotation Groups Only
(1)
(2)
(3)
(4)
Industry Missing Workers Total Workers Predicted Total Workers Percent Missing
Union coverage 517,201.6 372,914.4 890,116 58.10497
No union coverage 931,422.0 4,819,105.9 5,750,528 16.19716
Note:
The table above reports the number of missing workers by unionization status The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers from the last 6 months.
Table 2: Outgoing Rotation Groups Only
(1)
(2)
(3)
(4)
Industry Missing Workers Total Workers Predicted Total Workers Percent Missing
Union coverage -844,523.7 6,063,185 5,218,661 -16.18
No union coverage 1,108,398.5 32,842,558 33,950,957 3.26
Note:
The table above reports the number of missing workers by unionization status. The linear time trend was taken from January 2013 to January 2020 to calculate the number of missing workers from the last 6 months.