Modelling Drivers of Economic (In)Activity

Jon Minton

Who am I?

  • If you like this presentation: Public Health Intelligence Adviser at Public Health Scotland
  • If you don’t like this presentation: Just someone who likes coding in R

My challenge… now your challenge

Produce estimates of the influence of drivers on economic inactivity

  • drivers: Factors that influence probabilities of being economically active or inactive in some way
    • Can be divided into:
      • Downstream/Upstream
      • Proximate/Distal
      • Individual/Household/Local Area/Structural
  • economic inactivity:
    • Persons not contributing currently providing labour or labour supply:
      • Economic activity:
        • Employed (realised labour)
        • Unemployed (labour supply)
      • Economic inactivity:
        • Full-time student
        • Full-time carer
        • Long-term sick
        • Retired
        • Other (catch-all)

Useful ideas/concepts

  • Epidemiological concepts
  • Modelling concepts
  • Path dependency
  • Demography
  • ‘Explained by’
    • Population Attributable Fractions
    • Ideal Alternative Exposure Scenarios
  • Baseline and Counterfactual scenarios
  • Markov modelling
  • Generalised Linear modelling
  • Model fitting
  • Using models for prediction/projection

GLM fundamentals

Stochastic Component

\[ Y_i \sim f(\theta_i, \alpha) \] Systematic Component

\[ \theta_i = g(X_i, \beta) \]

Source: King, Tomz, Wittenberg (2000), Making the most of Statistical Analyses: Improving Interpretation and presentation

Stochastic

\[ Y_i \sim Bernoulli(\pi_i) \] Systematic

\[ \pi_i = \dfrac{1}{1 + e^{-X\beta}} \]

Stochastic

\[ Y_i \sim N(\mu_i, \sigma^2) \]

Systematic

\[ \mu_i = X_i \beta \]

Model Fitting and Model Prediction

  • Satisfy loss function: Find values of \(\beta\) such that discrepency between \(y_i\) (observed response) and \(Y_i\) (predicted response given \(X_i\), observed predictors) is minimised.

  • Statisticians usually use likelihood theory to justify the loss function. ‘Data scientists’ can be a bit wilder. (e.g. RMSE on out-of-sample test data)

  • Once the best model parameters \(\beta\) have been identified for our model \(M\), we can now swap out observed predictors \(X_i\) for hypothetical predictors \(X^{(H)}\) to get conditional predictions \(Y | X^{(H)}\)
  • Let’s split the predictors \(X\) into \((X^*, Z)\), where \(X^*\) is variables we aren’t interested in but need to control for, and \(Z\) are the exposure variables for which we are interested in modelling the causal effects on \(Y\).

  • \(H_1 := (X^*, Z=1)\) : The exposure is ‘on’

  • \(H_0 := (X^*, Z=0)\) : The exposure is ‘off’

  • Comparison of \(Y|H_1\) and \(Y|H_0\) gives an effect estimate of \(Z\) on \(Y\), i.e. the amount that \(Y\) changes as a result of \(Z\), holding all else constant.

  • In practice:

    • Many exposures are not either ‘on’ or ‘off’, but continuous
    • Rather than \(H_1\) being \(Z=1\) for all observations, we might use the observed values of \(Z\) from an indicative dataset. (Not everyone smokes/ has diabetes etc)
  • We can use the same GLM ‘chassis’ to model the influence of history.

  • Swap out \(Y\) for \(Y_{T+1}\) on the response side

  • Include \(Y_T\) on the predictor side

  • If Y is continuous we have conventional time-series modelling

  • If Y is discrete we have Markov modelling (broadly speaking)

In practice

We require data where:

  • we can observe the same individuals transitioning between discrete states over time
  • there are adequate demographic controls
  • there are driver/exposure variables of interest

I have used UK Household Longitudinal Study (UKHLS), which grew out of the British Household Panel Survey (BHPS).

A model that adequately controls for path dependency \(Y_T\) and demography \(X^*\)

  • Different specifications can do this. I used AIC & BIC to decide between different possible specifications

A model that incrementally builds on the Foundational model to include at least one exposure variable as well.

  • Decision guided both by substantive interest in a given exposure, and whether AIC and BIC suggest it is ‘better’ than the Foundational model
  • Absent clear epidemiological knowledge otherwise, continuous exposures are standardised then, in the counterfactual scenario, moved one standardised unit in the ‘good’ direction.

Illustration using logit model

# A tibble: 99,989 × 6
       pidp wave    age sex    this_status next_status
      <dbl> <chr> <dbl> <chr>  <lgl>       <lgl>      
 1 68004087 a        59 male   TRUE        TRUE       
 2 68006127 a        39 female TRUE        TRUE       
 3 68007487 a        57 female TRUE        TRUE       
 4 68008167 a        38 female FALSE       FALSE      
 5 68008171 a        51 male   TRUE        TRUE       
 6 68008847 a        51 female TRUE        TRUE       
 7 68009527 a        31 male   TRUE        TRUE       
 8 68010887 a        45 female TRUE        TRUE       
 9 68011567 a        35 male   TRUE        TRUE       
10 68014287 a        39 female FALSE       FALSE      
# ℹ 99,979 more rows
mod_00 <- glm(
  next_status ~ this_status * sex + splines::bs(age, 5),
  family = binomial(link = 'logit'),
  data = df_ind
  )

summary(mod_00)

Call:
glm(formula = next_status ~ this_status * sex + splines::bs(age, 
    5), family = binomial(link = "logit"), data = df_ind)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6957  -0.2654   0.2493   0.3496   2.7009  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -1.90210    0.04773 -39.852  < 2e-16 ***
this_statusTRUE          3.92767    0.02896 135.631  < 2e-16 ***
sexmale                  0.38281    0.03526  10.857  < 2e-16 ***
splines::bs(age, 5)1     1.11093    0.10478  10.602  < 2e-16 ***
splines::bs(age, 5)2     0.39845    0.07766   5.131 2.88e-07 ***
splines::bs(age, 5)3     1.26160    0.09308  13.554  < 2e-16 ***
splines::bs(age, 5)4    -0.10536    0.07812  -1.349    0.177    
splines::bs(age, 5)5    -1.71904    0.07079 -24.285  < 2e-16 ***
this_statusTRUE:sexmale  0.33342    0.04812   6.929 4.23e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 115328  on 99988  degrees of freedom
Residual deviance:  55307  on 99980  degrees of freedom
AIC: 55325

Number of Fisher Scoring iterations: 6
# A tibble: 12 × 5
     age sex    this_status prob_active prob_inactive
   <dbl> <chr>  <lgl>             <dbl>         <dbl>
 1    25 male   TRUE             0.971         0.0289
 2    25 male   FALSE            0.322         0.678 
 3    25 female TRUE             0.943         0.0574
 4    25 female FALSE            0.244         0.756 
 5    40 male   TRUE             0.972         0.0279
 6    40 male   FALSE            0.329         0.671 
 7    40 female TRUE             0.944         0.0556
 8    40 female FALSE            0.251         0.749 
 9    60 male   TRUE             0.888         0.112 
10    60 male   FALSE            0.101         0.899 
11    60 female TRUE             0.795         0.205 
12    60 female FALSE            0.0710        0.929 

In practice: Multinomial Logit

  • multinomial logit extends from two mutually exclusive states to K mutually exclusive states
  • implemented using mlogit function in nnet package
  • more computationally intensive but not different conceptually
# A tibble: 99,989 × 6
       pidp wave    age sex    this_status   next_status            
      <dbl> <chr> <dbl> <chr>  <chr>         <chr>                  
 1 68004087 a        59 male   Employed      Employed               
 2 68006127 a        39 female Unemployed    Unemployed             
 3 68007487 a        57 female Unemployed    Unemployed             
 4 68008167 a        38 female Inactive care Inactive long term sick
 5 68008171 a        51 male   Employed      Employed               
 6 68008847 a        51 female Employed      Employed               
 7 68009527 a        31 male   Employed      Unemployed             
 8 68010887 a        45 female Employed      Employed               
 9 68011567 a        35 male   Employed      Employed               
10 68014287 a        39 female Inactive care Inactive care          
# ℹ 99,979 more rows
mod_00 <- 
  nnet::multinom(
    next_status ~ this_status * sex + splines::bs(age, 5),
    data = df_ind
  )
## # weights:  140 (114 variable)
## initial  value 194569.609894 
## iter  10 value 54214.629019
## iter  20 value 51957.007618
## iter  30 value 50975.093843
## iter  40 value 49988.305916
## iter  50 value 49844.469018
## iter  60 value 49757.619656
## iter  70 value 49717.329389
## iter  80 value 49701.151713
## iter  90 value 49689.112117
## iter 100 value 49682.003832
## final  value 49682.003832 
## stopped after 100 iterations

summary(mod_00)
## Call:
## nnet::multinom(formula = next_status ~ this_status * sex + splines::bs(age, 
##     5), data = df_ind)
## 
## Coefficients:
##                         (Intercept) this_statusInactive care
## Inactive care             -4.264675                 5.580379
## Inactive long term sick   -6.353690                 4.081220
## Inactive other            -5.150505                 3.780599
## Inactive retired          -8.785411                 3.571897
## Inactive student          -1.600841                 1.845938
## Unemployed                -2.908889                 3.393434
##                         this_statusInactive long term sick
## Inactive care                                     4.540475
## Inactive long term sick                           8.322309
## Inactive other                                    4.208462
## Inactive retired                                  4.638818
## Inactive student                                  2.661586
## Unemployed                                        4.568436
##                         this_statusInactive other this_statusInactive retired
## Inactive care                            3.647580                    4.035039
## Inactive long term sick                  3.578256                    4.407072
## Inactive other                           5.826705                    3.439706
## Inactive retired                         2.944156                    5.499531
## Inactive student                         2.771293                    2.972128
## Unemployed                               2.873311                    2.178185
##                         this_statusInactive student this_statusUnemployed
## Inactive care                              1.299945              3.509757
## Inactive long term sick                    2.168923              4.275870
## Inactive other                             2.713042              2.963995
## Inactive retired                           1.265626              2.411328
## Inactive student                           3.715771              1.905086
## Unemployed                                 2.147735              3.973331
##                            sexmale splines::bs(age, 5)1 splines::bs(age, 5)2
## Inactive care           -2.8303907            0.8112421            0.9099609
## Inactive long term sick -0.0556538            0.2559587            0.4892264
## Inactive other           0.2073837           -1.5827796           -1.1475237
## Inactive retired        -0.2872212           -0.5234163           -2.5719178
## Inactive student        -0.3962258           -1.8118528           -3.1288966
## Unemployed               0.3195387           -0.6178169           -1.0568732
##                         splines::bs(age, 5)3 splines::bs(age, 5)4
## Inactive care                     -0.1647709            0.8361373
## Inactive long term sick            0.9226743            1.9611845
## Inactive other                    -1.0906306           -0.4001185
## Inactive retired                   3.5484489            6.2454401
## Inactive student                  -3.9914465           -5.0954977
## Unemployed                        -1.1154563           -0.5016708
##                         splines::bs(age, 5)5 this_statusInactive care:sexmale
## Inactive care                      0.3569390                        2.5082715
## Inactive long term sick            1.1868514                        0.2302962
## Inactive other                    -0.5552339                        1.2885666
## Inactive retired                   7.9584280                       -0.3725108
## Inactive student                  -6.4353343                       -0.3348881
## Unemployed                        -1.4102953                        0.4964148
##                         this_statusInactive long term sick:sexmale
## Inactive care                                           0.42725912
## Inactive long term sick                                -0.08758388
## Inactive other                                         -0.59378873
## Inactive retired                                       -0.35783551
## Inactive student                                        0.16734681
## Unemployed                                             -0.21685831
##                         this_statusInactive other:sexmale
## Inactive care                                  1.62474870
## Inactive long term sick                        0.26089140
## Inactive other                                -0.82724020
## Inactive retired                              -0.22296908
## Inactive student                               0.15768088
## Unemployed                                    -0.03014452
##                         this_statusInactive retired:sexmale
## Inactive care                                    0.50020854
## Inactive long term sick                          0.52587521
## Inactive other                                   0.09880642
## Inactive retired                                -0.05111257
## Inactive student                                -1.66491062
## Unemployed                                       0.13247450
##                         this_statusInactive student:sexmale
## Inactive care                                    0.56153093
## Inactive long term sick                          0.31393653
## Inactive other                                   0.03498304
## Inactive retired                                -0.15240489
## Inactive student                                 0.39807143
## Unemployed                                       0.07827380
##                         this_statusUnemployed:sexmale
## Inactive care                              0.34742950
## Inactive long term sick                   -0.33434819
## Inactive other                            -0.65373515
## Inactive retired                          -0.25361540
## Inactive student                           0.04357869
## Unemployed                                -0.15493257
## 
## Std. Errors:
##                         (Intercept) this_statusInactive care
## Inactive care            0.16248654               0.05309671
## Inactive long term sick  0.26073990               0.11645678
## Inactive other           0.23701306               0.17052665
## Inactive retired         1.71307219               0.09996639
## Inactive student         0.08087998               0.14558880
## Unemployed               0.08849234               0.06684793
##                         this_statusInactive long term sick
## Inactive care                                    0.1366876
## Inactive long term sick                          0.1396415
## Inactive other                                   0.3193084
## Inactive retired                                 0.1541554
## Inactive student                                 0.3425172
## Unemployed                                       0.1363659
##                         this_statusInactive other this_statusInactive retired
## Inactive care                           0.1592756                   0.1466636
## Inactive long term sick                 0.2969801                   0.1883779
## Inactive other                          0.1990749                   0.3651422
## Inactive retired                        0.2661224                   0.1136774
## Inactive student                        0.2491562                   0.6784096
## Unemployed                              0.1965656                   0.2593636
##                         this_statusInactive student this_statusUnemployed
## Inactive care                            0.13446168            0.06609202
## Inactive long term sick                  0.25346504            0.11472025
## Inactive other                           0.22623500            0.21334522
## Inactive retired                         0.71542075            0.15317899
## Inactive student                         0.07271687            0.11849251
## Unemployed                               0.08492230            0.06098611
##                            sexmale splines::bs(age, 5)1 splines::bs(age, 5)2
## Inactive care           0.15989693            0.2591058            0.1585449
## Inactive long term sick 0.12231001            0.4075431            0.2506371
## Inactive other          0.16255927            0.4041402            0.3241556
## Inactive retired        0.06643554            3.0747685            1.7525551
## Inactive student        0.08630838            0.1423782            0.1863921
## Unemployed              0.05046323            0.1494099            0.1097848
##                         splines::bs(age, 5)3 splines::bs(age, 5)4
## Inactive care                      0.2210071            0.1878868
## Inactive long term sick            0.3147839            0.2646640
## Inactive other                     0.3677834            0.3320895
## Inactive retired                   1.7793310            1.7047936
## Inactive student                   0.2968194            0.5222790
## Unemployed                         0.1403807            0.1309305
##                         splines::bs(age, 5)5 this_statusInactive care:sexmale
## Inactive care                      0.2237352                        0.2332628
## Inactive long term sick            0.2865589                        0.3279407
## Inactive other                     0.3618112                        0.3354850
## Inactive retired                   1.7159605                        0.3105845
## Inactive student                   1.0496550                        1.0183519
## Unemployed                         0.1622973                        0.2073871
##                         this_statusInactive long term sick:sexmale
## Inactive care                                            0.3333645
## Inactive long term sick                                  0.1970728
## Inactive other                                           0.4744219
## Inactive retired                                         0.2076420
## Inactive student                                         0.5084220
## Unemployed                                               0.1878839
##                         this_statusInactive other:sexmale
## Inactive care                                   0.3286156
## Inactive long term sick                         0.4137753
## Inactive other                                  0.3012812
## Inactive retired                                0.4079154
## Inactive student                                0.3588901
## Unemployed                                      0.2683190
##                         this_statusInactive retired:sexmale
## Inactive care                                     0.3686044
## Inactive long term sick                           0.2498561
## Inactive other                                    0.4745720
## Inactive retired                                  0.1648817
## Inactive student                                  1.7316038
## Unemployed                                        0.3331755
##                         this_statusInactive student:sexmale
## Inactive care                                     0.4684662
## Inactive long term sick                           0.3535180
## Inactive other                                    0.2760982
## Inactive retired                                  1.1947170
## Inactive student                                  0.1054531
## Unemployed                                        0.1069839
##                         this_statusUnemployed:sexmale
## Inactive care                              0.20434462
## Inactive long term sick                    0.16038453
## Inactive other                             0.29385131
## Inactive retired                           0.19076501
## Inactive student                           0.16731369
## Unemployed                                 0.07832383
## 
## Residual Deviance: 99364.01 
## AIC: 99592.01
# A tibble: 12 × 10
     age sex    this_status Employed `Inactive care` `Inactive long term sick`
   <dbl> <chr>  <chr>          <dbl>           <dbl>                     <dbl>
 1    25 male   Employed       0.942         0.00159                   0.00211
 2    25 male   Unemployed     0.325         0.0260                    0.0375 
 3    25 female Employed       0.920         0.0264                    0.00218
 4    25 female Unemployed     0.262         0.251                     0.0447 
 5    40 male   Employed       0.964         0.00121                   0.00321
 6    40 male   Unemployed     0.405         0.0241                    0.0693 
 7    40 female Employed       0.951         0.0203                    0.00334
 8    40 female Unemployed     0.328         0.234                     0.0829 
 9    60 male   Employed       0.879         0.00126                   0.00677
10    60 male   Unemployed     0.273         0.0185                    0.108  
11    60 female Employed       0.844         0.0206                    0.00687
12    60 female Unemployed     0.199         0.162                     0.117  
# ℹ 4 more variables: `Inactive other` <dbl>, `Inactive retired` <dbl>,
#   `Inactive student` <dbl>, Unemployed <dbl>

Exposure Model 1: Clinical Depression


# Select variables to take from all wave-specific datasets

varnames <-  c(
  "jbstat", "dvage", "sex", "hcond17"
  )

vartypes <- c(
  "labels", "values", "labels", "labels"
  )

# Grab from first 11 waves 

df_ind_hconds <- get_ind_level_vars_for_selected_waves(varnames = varnames, vartypes = vartypes, waves = letters[1:11])

# Tidy dataset with Y_T+1, Y_T, X and Z

df_ind_hconds_tidied <- 
  df_ind_hconds |> 
    mutate(across(dvage, function(x) ifelse(x < 0, NA, x))) |> 
    mutate(across(hcond17, 
      function(x) {
        case_when(
          x == 'Mentioned' ~ TRUE,
          x == 'not mentioned' ~ FALSE,
          TRUE ~ NA
        )
      }
      )
    ) |> 
    rename(
      has_clinicaldepression = hcond17,
      age = dvage
    ) %>%
    filter(complete.cases(.)) 
  
# Run model with exposure term Z, has_clinicaldepression

mod_depression <- 
  nnet::multinom(
    next_status ~ this_status * sex + splines::bs(age, 5) + has_clinicaldepression,
    data = df_ind_hconds_tidied
  )
## # weights:  147 (120 variable)
## initial  value 88083.568807 
## iter  10 value 28309.304681
## iter  20 value 25694.480117
## iter  30 value 23022.479768
## iter  40 value 22334.449757
## iter  50 value 22129.818514
## iter  60 value 22088.844115
## iter  70 value 22077.766178
## iter  80 value 22067.862459
## iter  90 value 22022.759233
## iter 100 value 22007.391395
## final  value 22007.391395 
## stopped after 100 iterations


# Compare exposure model against foundational model
BIC(mod_00, mod_depression)
##                 df       BIC
## mod_00         114 100676.47
## mod_depression 120  45301.22
AIC(mod_00, mod_depression)
##                 df      AIC
## mod_00         114 99592.01
## mod_depression 120 44254.78

# If the exposure model 'beats' the foundational model (it does) then 
# create a baseline and counterfactual scenario dataset X_H0 and X_H1

# I'm going to use observed data from first wave for X_H0

df_baseline <-
  df_ind_hconds_tidied |> 
  filter(wave == 'a')


# For X_H1, I'm going to set everyone with clinical depression to no clinical 
# depression
df_counterfactual_depressaway <-
  df_baseline |> 
  mutate(has_clinicaldepression = FALSE)

# For X_H0, we can get the probability of each person moving to each possible state
# using the predict function on the df_baseline dataset, setting type to 'probs'
preds_df_baseline <- 
  predict(mod_depression, newdata = df_baseline, type = "probs")

# For X_H1, we can do the same, but with the counterfactual dataset X_H1
preds_df_counter <- 
  predict(mod_depression, newdata = df_counterfactual_depressaway, type = "probs")

# A bit of base R 'magic' to add up the total number of people predicted to be in each state under both scenarios

predictions_summary_matrix <- cbind(
  # The number 2 indicates do the sum function for each column.
  # If it were 1 then this would sum for each row (which should add up to 1 in call cases)
  apply(preds_df_baseline, 2, sum),
  apply(preds_df_counter, 2, sum)
)

colnames(predictions_summary_matrix) <- c("base", "counterfactual")
predictions_summary_matrix
##                               base counterfactual
## Employed                19851.4570     19957.7329
## Inactive care            2592.5688      2640.7475
## Inactive long term sick  1440.9549      1294.5141
## Inactive other            165.3931       159.8095
## Inactive retired         8443.1883      8458.3688
## Inactive student         1889.9852      1885.0025
## Unemployed               2019.4527      2006.8245

# And we can also calculate the % change in each state when going from the baseline
# to counterfactual scenario
sim_relative_change <- apply(
    predictions_summary_matrix, 1, function(x) (100 * x / x[1])
  ) |> 
  t()

sim_relative_change
##                         base counterfactual
## Employed                 100      100.53536
## Inactive care            100      101.85834
## Inactive long term sick  100       89.83724
## Inactive other           100       96.62405
## Inactive retired         100      100.17980
## Inactive student         100       99.73637
## Unemployed               100       99.37467

Predictions/Projections

                              base counterfactual
Employed                19851.4570     19957.7329
Inactive care            2592.5688      2640.7475
Inactive long term sick  1440.9549      1294.5141
Inactive other            165.3931       159.8095
Inactive retired         8443.1883      8458.3688
Inactive student         1889.9852      1885.0025
Unemployed               2019.4527      2006.8245
                        base counterfactual
Employed                 100      100.53536
Inactive care            100      101.85834
Inactive long term sick  100       89.83724
Inactive other           100       96.62405
Inactive retired         100      100.17980
Inactive student         100       99.73637
Unemployed               100       99.37467
  • “If the effects of clinical depression on economic inactivity were fully mitigated, then the size of the economically inactive long-term sick population could be reduced by around a tenth.”1

  • “Around a tenth of economic inactivity due to long-term sickness is explained by clinical depression.”2

Second example: Improving health in general

Table 1: Effect of moving MH and PH by 1 SD collectively
State base counterfactual Absolute Change Relative Change
Employed 12530 12827 297 2.4% up
Unemployed 619 524 -95 15.3% down
Inactive student 88 93 5 5.7% up
Inactive care 857 844 -13 1.5% down
Inactive long term sick 681 478 -203 29.8% down
Inactive retired 520 528 8 1.5% up
Inactive other 66 65 -1 1.5% down

General framework

flowchart TB
  D[Data]
  M0(Foundational Model)
  M1(Exposure Model)
  Tx{Compare Models}
  H0[Baseline]
  H1[Counterfactual]
  I((Imagination))
  Y0(Baseline Results)
  Y1(Counterfactual Results)
  Yx{Compare Results}
  Rx[Summarised differences]
  
  D -->|fit|M0
  D -->|fit|M1
  M0 -->|AIC/BIC|Tx
  M1 -->|AIC/BIC|Tx
  Tx -->|validate|M1
  D -->|subset|H0
  H0 -->|modify|H1
  I -->|make up|H0
  I -->|make up|H1
  H0 --> M1 --> Y0
  H1 --> M1 --> Y1
  Y0 --> Yx
  Y1 --> Yx
  Yx -->|report| Rx
  
  

Concluding thoughts

  • Simple question, complicated methods
  • LM vs GLM
  • Many ‘heroic assumptions’
  • Many potential refinements
  • Many potential applications

Contact

  • jon.will.minton@gmail.com
  • jon.minton@phs.scot
  • @jonminton on the X social media platform
  • Website: https://jonminton.net

Thanks for listening!