22 Effect of wages on remaining employed

Aims

The aims of this notebook are to understand the effect of wages on remaining employed.

The variables to look at are:

  • employees_w12_paygl: gross pay at last payment
  • employees_w12_paynl: take home pay at last payment
  • seearngrs_dv: self-employment earnings - gross
  • seearnnet_dv: self-employment earnings - net

We are predominantly interested in the extent to which different levels of the above affect probability of those employed remaining employed at the next wave.

Preparation

Code
library(tidyverse)
library(nnet)

devtools::load_all(here::here('R'))

varnames <-  c(
  "jbstat", "dvage", "sex", 
  "payg_dv", "payn_dv" # Gross and net monthly take-home pay
  )

vartypes <- c(
  "labels", "values", "labels",  
  "values", "values"
  )

df_ind <- get_ind_level_vars_for_selected_waves(varnames = varnames, vartypes = vartypes, waves = letters[1:11])

# Clean the data 
df_ind_wages_standardised <- 
  df_ind |> 
  # dvage uses negative values to indicate missing. The code below explicitly turns them all to missing values
    mutate(across(c(dvage, payn_dv, payg_dv), function(x) ifelse(x < 0, NA, x))) |>
  # This renames dvage to age
    rename(age = dvage) |> 
    mutate(
      ifelse(sex %in% c("male", "female"), sex, NA)
    ) |> 
    filter(between(age, 16, 64))  %>% 
    filter(complete.cases(.)) |> 
    group_by(wave) 

We are not yet sure whether the derived wages variables includes second jobs.

Let’s now look at the range of take-home pay by wave for those employed

Code
df_ind_wages_standardised |> 
  group_by(wave) |> 
  summarise(
    lower_q = quantile(payg_dv, 0.25, na.rm = TRUE),
    median_q = median(payg_dv, na.rm = TRUE),
    upper_q = quantile(payg_dv, 0.75, na.rm = TRUE)
  )
# A tibble: 11 × 4
   wave  lower_q median_q upper_q
   <chr>   <dbl>    <dbl>   <dbl>
 1 a        882.    1508    2481.
 2 b        917.    1583    2500 
 3 c        932     1600    2500 
 4 d        953.    1625    2583.
 5 e        990     1667.   2607 
 6 f       1000     1700    2667.
 7 g       1018.    1733.   2708.
 8 h       1080     1772.   2800 
 9 i       1090.    1800    2833.
10 j       1167.    1900    2968 
11 k       1233.    2000    3035.

Because the median and distribution of wages changed with each wave, in order to make use of all waves worth of data as predictors of remaining employment, we should probably look at the effect of wage-specific quantiles on remaining employed

Code
df_ind_wages_normalised <- 
  df_ind_wages_standardised |> 
    group_by(wave) |> 
    mutate(
      z_payg_dv = ecdf(payg_dv)(payg_dv),
      z_payn_dv = ecdf(payn_dv)(payn_dv)
    ) |> 
    mutate(not_employed = next_status != "Employed")

Now let’s start with the simplest model, just predicting whether next_wave is still employed

Code
mod_00 <- glm(not_employed ~ z_payg_dv, family = binomial(link = 'logit'), data = df_ind_wages_normalised)
Code
summary(mod_00)

Call:
glm(formula = not_employed ~ z_payg_dv, family = binomial(link = "logit"), 
    data = df_ind_wages_normalised)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7429  -0.4677  -0.3053  -0.2005   2.9981  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.14626    0.01511  -75.87   <2e-16 ***
z_payg_dv   -3.33670    0.03963  -84.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 90359  on 163615  degrees of freedom
Residual deviance: 81514  on 163614  degrees of freedom
AIC: 81518

Number of Fisher Scoring iterations: 6

The more relative pay people in employment get, the lower the probability of moving from employment becomes, and this term is statistically significant.

What does this mean substantively?

Code
predict(mod_00, type = "response", newdata = data.frame(z_payg_dv = c(0.2, 0.4, 0.5, 0.6, 0.8)))
         1          2          3          4          5 
0.14020359 0.07720525 0.05653982 0.04115915 0.02154949 

For someone earning less than 80% of earners, the probability of no longer being employed in the next wave is 14%. For someone earning more than 80% of earners, the probability of no longer being employed in the next wave is 2%. For someone earning the median amount, the probability of not being employed in the next wave is 5.6%.

Now let’s add the standard controls

Code
mod_01 <- glm(not_employed ~ sex + splines::bs(age, 5) + z_payg_dv, family = binomial(link = 'logit'), data = df_ind_wages_normalised)
Code
summary(mod_01)

Call:
glm(formula = not_employed ~ sex + splines::bs(age, 5) + z_payg_dv, 
    family = binomial(link = "logit"), data = df_ind_wages_normalised)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5968  -0.3699  -0.2712  -0.1980   3.0013  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           0.92840    0.04069  22.815   <2e-16 ***
sexmale               0.01957    0.02161   0.906    0.365    
splines::bs(age, 5)1 -3.06056    0.09541 -32.078   <2e-16 ***
splines::bs(age, 5)2 -3.19548    0.07231 -44.194   <2e-16 ***
splines::bs(age, 5)3 -3.72907    0.09070 -41.115   <2e-16 ***
splines::bs(age, 5)4 -2.47387    0.07098 -34.854   <2e-16 ***
splines::bs(age, 5)5 -1.07421    0.06181 -17.378   <2e-16 ***
z_payg_dv            -1.98243    0.04503 -44.027   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 90359  on 163615  degrees of freedom
Residual deviance: 73228  on 163608  degrees of freedom
AIC: 73244

Number of Fisher Scoring iterations: 6

Once age is controlled for in the standard way, the effect of sex becomes non-significant.

Let’s compare the models

Code
AIC(mod_00, mod_01)
       df      AIC
mod_00  2 81518.31
mod_01  8 73244.14
Code
BIC(mod_00, mod_01)
       df      BIC
mod_00  2 81538.32
mod_01  8 73324.18

The more complex model is still preferred

Let’s work out what mod_01 is implying

Code
pred_df <- expand_grid(
  z_payg_dv = c(0.2, 0.4, 0.5, 0.6, 0.8),
  sex = c("male", "female"),
  age = 20:60
)

pred_df <- pred_df |> 
  mutate(
    pred_move = predict(mod_01, type = "response", newdata = pred_df)
  )

Let’s see what this shows:

Code
pred_df |> 
  ggplot(
    aes(
      x = age, y = pred_move, group = factor(z_payg_dv), color = factor(z_payg_dv)
    )) + geom_line() + facet_wrap(~sex)

The model predicts a strongly U-shaped relationship with age. There is no interaction between age and quantile. We could look at that next

Code
mod_02 <- glm(not_employed ~ sex + splines::bs(age, 5) * z_payg_dv, family = binomial(link = 'logit'), data = df_ind_wages_normalised)

AIC(mod_00, mod_01, mod_02)
       df      AIC
mod_00  2 81518.31
mod_01  8 73244.14
mod_02 13 72006.08
Code
BIC(mod_00, mod_01, mod_02)
       df      BIC
mod_00  2 81538.32
mod_01  8 73324.18
mod_02 13 72136.15

The model with interactions between age and relative wage is preferred over the next most complex model. As before, let’s look at what the model predicts

Code
pred_df <- expand_grid(
  z_payg_dv = c(0.2, 0.4, 0.5, 0.6, 0.8),
  sex = c("male", "female"),
  age = 20:60
)

pred_df <- pred_df |> 
  mutate(
    pred_move = predict(mod_02, type = "response", newdata = pred_df)
  )
Code
pred_df |> 
  ggplot(
    aes(
      x = age, y = pred_move, group = factor(z_payg_dv), color = factor(z_payg_dv)
    )) + geom_line() + facet_wrap(~sex)

Comparatively wages have a very strong influence on probability of remaining employed at younger adult ages. These influences diminish with age. From the early to mid 50s the probability of leaving employment increases regardless of comparative wages.

There are a couple more modifications to consider:

  • Include an interaction between sex and comparative wages (for example, are women more likely to exit employment if they receive low wages than men, or vice versa?)

  • Move to multinomial logistic regression, to model where people move if they move from employment

Let’s start with the sex interaction term

Code
mod_03 <- glm(not_employed ~ sex + sex:z_payg_dv + splines::bs(age, 5) * z_payg_dv, family = binomial(link = 'logit'), data = df_ind_wages_normalised)

AIC(mod_00, mod_01, mod_02, mod_03)
       df      AIC
mod_00  2 81518.31
mod_01  8 73244.14
mod_02 13 72006.08
mod_03 14 72008.03
Code
BIC(mod_00, mod_01, mod_02, mod_03)
       df      BIC
mod_00  2 81538.32
mod_01  8 73324.18
mod_02 13 72136.15
mod_03 14 72148.10

The model with a sex:wage interaction is not preferred to the model without such terms.

Now let’s run the multinomial logit model:

Code
mod_multi_00 <- nnet::multinom(
  next_status ~ sex  + splines::bs(age, 5) * z_payg_dv, 
  data = df_ind_wages_normalised
  )
# weights:  98 (78 variable)
initial  value 318382.034948 
iter  10 value 60337.114215
iter  20 value 53099.169011
iter  30 value 49280.455396
iter  40 value 48155.231702
iter  50 value 47801.646126
iter  60 value 47358.005159
iter  70 value 47306.211706
iter  80 value 47260.578908
iter  90 value 47073.166846
iter 100 value 47035.764377
final  value 47035.764377 
stopped after 100 iterations

Let’s now see what it predicts:

Code
pred_df <- expand_grid(
  z_payg_dv = c(0.2, 0.4, 0.5, 0.6, 0.8),
  sex = c("male", "female"),
  age = 20:60
)

pred_preds_df <- bind_cols(
  pred_df,
  predict(mod_multi_00, type = "probs", newdata = pred_df)
) |> 
  pivot_longer(
    Employed:Unemployed, names_to = "state", values_to = "probability"
  )

Now to visualise it, focusing on Unemployment, Inactive care, Inactive long term sick, Inactive student, and Inactive retired

Code
pred_preds_df |> 
  filter(
    state %in% c("Employed", "Inactive care", "Inactive student", "Inactive long term sick", "Unemployed", "Inactive retired")
  ) |> 
  ggplot(aes(x = age, y = probability, colour = factor(z_payg_dv), group = z_payg_dv)) +
  geom_line() + 
  facet_grid(state ~ sex, scales = "free_y")

Let’s look at the equivalent for net

Code
mod_multi_00_net <- nnet::multinom(
  next_status ~ sex  + splines::bs(age, 5) * z_payn_dv, 
  data = df_ind_wages_normalised
  )
# weights:  98 (78 variable)
initial  value 318382.034948 
iter  10 value 60451.180405
iter  20 value 53017.520478
iter  30 value 49595.894512
iter  40 value 48231.978397
iter  50 value 47910.906606
iter  60 value 47429.744570
iter  70 value 47397.698092
iter  80 value 47350.414975
iter  90 value 47174.938954
iter 100 value 47120.210532
final  value 47120.210532 
stopped after 100 iterations
Code
pred_df <- expand_grid(
  z_payn_dv = c(0.2, 0.4, 0.5, 0.6, 0.8),
  sex = c("male", "female"),
  age = 20:60
)

pred_preds_df_net <- bind_cols(
  pred_df,
  predict(mod_multi_00_net, type = "probs", newdata = pred_df)
) |> 
  pivot_longer(
    Employed:Unemployed, names_to = "state", values_to = "probability"
  )
Code
pred_preds_df_net |> 
  filter(
    state %in% c("Employed", "Inactive care", "Inactive student", "Inactive long term sick", "Unemployed", "Inactive retired")
  ) |> 
  ggplot(aes(x = age, y = probability, colour = factor(z_payn_dv), group = z_payn_dv)) +
  geom_line() + 
  facet_grid(state ~ sex, scales = "free_y")

I don’t think there’s any substantive difference whether using gross or net pay. For consistency I’m going to keep using gross pay for the counterfactual

Counterfactual Simulation

Let’s model the following counterfactual:

  • If someone is earning at at least the 40th percentile, their pay stays as is;

  • If someone is earning below the 40th percentile, their pay is moved up to the 40th percentile

Code
df_ind_baseline <- 
  df_ind_wages_normalised |>
    ungroup() |> 
    filter(wave == 'j')

df_ind_counterfactual <-
  df_ind_baseline |> 
  mutate(
    z_payg_dv = ifelse(z_payg_dv < 0.4, 0.4, z_payg_dv)
  )

preds_baseline <- predict(mod_multi_00, newdata = df_ind_baseline, type = "probs")

preds_counterfactual <- predict(mod_multi_00, newdata = df_ind_counterfactual, type = "probs")


predictions_summary_matrix <- cbind(
  # The number 2 indicates do the sum function for each column.
  # If it were 1 then this would sum for each row (which should add up to 1 in call cases)
  apply(preds_baseline, 2, sum),
  apply(preds_counterfactual, 2, sum)
)

colnames(predictions_summary_matrix) <- c("base", "counterfactual")
predictions_summary_matrix
                               base counterfactual
Employed                10882.48087    11230.64693
Inactive care              89.46711       52.04180
Inactive long term sick    49.18924       38.07907
Inactive other             26.78770       22.78726
Inactive retired          229.76486      227.66949
Inactive student          352.69113       95.13879
Unemployed                234.61909      198.63665
Code
sim_relative_change <- apply(
    predictions_summary_matrix, 1, function(x) (100 * x / x[1])
  ) |> 
  t()

sim_relative_change
                        base counterfactual
Employed                 100      103.19933
Inactive care            100       58.16864
Inactive long term sick  100       77.41343
Inactive other           100       85.06615
Inactive retired         100       99.08804
Inactive student         100       26.97510
Unemployed               100       84.66346

In this scenario, the proportion of those employed who remain employed is expected to increase by around 3%, the proportion transitioning from employment to full-time care to fall by around 40%, from employment to long-term sick to fall by around 22%, and to unemployment by around 15%. Many of these transitions from employment are quite unusual; instead such transitions may occur over multiple waves, often via unemployment through an intermediate state. The following table shows the proportion of the population projected to move into each state in both the baseline and counterfactual scenario:

Code
prop_in_scenario <- apply(
    predictions_summary_matrix, 2, function(x) (100 * x / sum(x))
  ) 


prop_in_scenario
                              base counterfactual
Employed                91.7191814     94.6535772
Inactive care            0.7540422      0.4386161
Inactive long term sick  0.4145743      0.3209361
Inactive other           0.2257707      0.1920545
Inactive retired         1.9364927      1.9188326
Inactive student         2.9725337      0.8018440
Unemployed               1.9774049      1.6741395

So, in the baseline scenario around 92% of those employed in wave T are predicted to remain employed at wave T+1, around 3% to become a student at wave T+1, around 2% to become retired, or to become unemployed. In the counterfactual scenario the proportion remaining employed is projected to increase retention of employment by 2.9%.

Notes from Martin

The function ecdf converts the real wages into a value between 0 and 1, where e.g. 0.2 is 20% up the wages distribution, with 80% of waged employees earning more than them.

This allows us to compare relative position across all the waves without adjusting wages 

Model 1 - probability of moving from employment to other categories, based on age sex and wage position

We included an interaction between wages and age

But we don’t think at the broadest level sex makes a difference

This is all standard logistic regression

We don’t think we can use the Minimum Income Standard because this operates at a household income 

Could a wage based intervention increase inequalities?

By having people who moved into education remaining in better paid work?

Should we give a different floor? By age  group?