Satisfy loss function: Find values of \(\beta\) such that discrepency between \(y_i\) (observed response) and \(Y_i\) (predicted response given \(X_i\), observed predictors) is minimised.
Statisticians usually use likelihood theory to justify the loss function. ‘Data scientists’ can be a bit wilder. (e.g. RMSE on out-of-sample test data)
Once the best model parameters \(\beta\) have been identified for our model \(M\), we can now swap out observed predictors \(X_i\) for hypothetical predictors \(X^{(H)}\) to get conditional predictions \(Y | X^{(H)}\)
Let’s split the predictors \(X\) into \((X^*, Z)\), where \(X^*\) is variables we aren’t interested in but need to control for, and \(Z\) are the exposure variables for which we are interested in modelling the causal effects on \(Y\).
\(H_1 := (X^*, Z=1)\) : The exposure is ‘on’
\(H_0 := (X^*, Z=0)\) : The exposure is ‘off’
Comparison of \(Y|H_1\) and \(Y|H_0\) gives an effect estimate of \(Z\) on \(Y\), i.e. the amount that \(Y\) changes as a result of \(Z\), holding all else constant.
In practice:
Many exposures are not either ‘on’ or ‘off’, but continuous
Rather than \(H_1\) being \(Z=1\) for all observations, we might use the observed values of \(Z\) from an indicative dataset. (Not everyone smokes/ has diabetes etc)
We can use the same GLM ‘chassis’ to model the influence of history.
Swap out \(Y\) for \(Y_{T+1}\) on the response side
Include \(Y_T\) on the predictor side
If Y is continuous we have conventional time-series modelling
If Y is discrete we have Markov modelling (broadly speaking)
we can observe the same individuals transitioning between discrete states over time
there are adequate demographic controls
there are driver/exposure variables of interest
I have used UK Household Longitudinal Study (UKHLS), which grew out of the British Household Panel Survey (BHPS).
A model that adequately controls for path dependency \(Y_T\) and demography \(X^*\)
Different specifications can do this. I used AIC & BIC to decide between different possible specifications
A model that incrementally builds on the Foundational model to include at least one exposure variable as well.
Decision guided both by substantive interest in a given exposure, and whether AIC and BIC suggest it is ‘better’ than the Foundational model
Absent clear epidemiological knowledge otherwise, continuous exposures are standardised then, in the counterfactual scenario, moved one standardised unit in the ‘good’ direction.
multinomial logit extends from two mutually exclusive states to K mutually exclusive states
implemented using mlogit function in nnet package
more computationally intensive but not different conceptually
# A tibble: 99,989 × 6
pidp wave age sex this_status next_status
<dbl> <chr> <dbl> <chr> <chr> <chr>
1 68004087 a 59 male Employed Employed
2 68006127 a 39 female Unemployed Unemployed
3 68007487 a 57 female Unemployed Unemployed
4 68008167 a 38 female Inactive care Inactive long term sick
5 68008171 a 51 male Employed Employed
6 68008847 a 51 female Employed Employed
7 68009527 a 31 male Employed Unemployed
8 68010887 a 45 female Employed Employed
9 68011567 a 35 male Employed Employed
10 68014287 a 39 female Inactive care Inactive care
# ℹ 99,979 more rows
mod_00 <- nnet::multinom( next_status ~ this_status * sex + splines::bs(age, 5),data = df_ind )## # weights: 140 (114 variable)## initial value 194569.609894 ## iter 10 value 54214.629019## iter 20 value 51957.007618## iter 30 value 50975.093843## iter 40 value 49988.305916## iter 50 value 49844.469018## iter 60 value 49757.619656## iter 70 value 49717.329389## iter 80 value 49701.151713## iter 90 value 49689.112117## iter 100 value 49682.003832## final value 49682.003832 ## stopped after 100 iterationssummary(mod_00)## Call:## nnet::multinom(formula = next_status ~ this_status * sex + splines::bs(age, ## 5), data = df_ind)## ## Coefficients:## (Intercept) this_statusInactive care## Inactive care -4.264675 5.580379## Inactive long term sick -6.353690 4.081220## Inactive other -5.150505 3.780599## Inactive retired -8.785411 3.571897## Inactive student -1.600841 1.845938## Unemployed -2.908889 3.393434## this_statusInactive long term sick## Inactive care 4.540475## Inactive long term sick 8.322309## Inactive other 4.208462## Inactive retired 4.638818## Inactive student 2.661586## Unemployed 4.568436## this_statusInactive other this_statusInactive retired## Inactive care 3.647580 4.035039## Inactive long term sick 3.578256 4.407072## Inactive other 5.826705 3.439706## Inactive retired 2.944156 5.499531## Inactive student 2.771293 2.972128## Unemployed 2.873311 2.178185## this_statusInactive student this_statusUnemployed## Inactive care 1.299945 3.509757## Inactive long term sick 2.168923 4.275870## Inactive other 2.713042 2.963995## Inactive retired 1.265626 2.411328## Inactive student 3.715771 1.905086## Unemployed 2.147735 3.973331## sexmale splines::bs(age, 5)1 splines::bs(age, 5)2## Inactive care -2.8303907 0.8112421 0.9099609## Inactive long term sick -0.0556538 0.2559587 0.4892264## Inactive other 0.2073837 -1.5827796 -1.1475237## Inactive retired -0.2872212 -0.5234163 -2.5719178## Inactive student -0.3962258 -1.8118528 -3.1288966## Unemployed 0.3195387 -0.6178169 -1.0568732## splines::bs(age, 5)3 splines::bs(age, 5)4## Inactive care -0.1647709 0.8361373## Inactive long term sick 0.9226743 1.9611845## Inactive other -1.0906306 -0.4001185## Inactive retired 3.5484489 6.2454401## Inactive student -3.9914465 -5.0954977## Unemployed -1.1154563 -0.5016708## splines::bs(age, 5)5 this_statusInactive care:sexmale## Inactive care 0.3569390 2.5082715## Inactive long term sick 1.1868514 0.2302962## Inactive other -0.5552339 1.2885666## Inactive retired 7.9584280 -0.3725108## Inactive student -6.4353343 -0.3348881## Unemployed -1.4102953 0.4964148## this_statusInactive long term sick:sexmale## Inactive care 0.42725912## Inactive long term sick -0.08758388## Inactive other -0.59378873## Inactive retired -0.35783551## Inactive student 0.16734681## Unemployed -0.21685831## this_statusInactive other:sexmale## Inactive care 1.62474870## Inactive long term sick 0.26089140## Inactive other -0.82724020## Inactive retired -0.22296908## Inactive student 0.15768088## Unemployed -0.03014452## this_statusInactive retired:sexmale## Inactive care 0.50020854## Inactive long term sick 0.52587521## Inactive other 0.09880642## Inactive retired -0.05111257## Inactive student -1.66491062## Unemployed 0.13247450## this_statusInactive student:sexmale## Inactive care 0.56153093## Inactive long term sick 0.31393653## Inactive other 0.03498304## Inactive retired -0.15240489## Inactive student 0.39807143## Unemployed 0.07827380## this_statusUnemployed:sexmale## Inactive care 0.34742950## Inactive long term sick -0.33434819## Inactive other -0.65373515## Inactive retired -0.25361540## Inactive student 0.04357869## Unemployed -0.15493257## ## Std. Errors:## (Intercept) this_statusInactive care## Inactive care 0.16248654 0.05309671## Inactive long term sick 0.26073990 0.11645678## Inactive other 0.23701306 0.17052665## Inactive retired 1.71307219 0.09996639## Inactive student 0.08087998 0.14558880## Unemployed 0.08849234 0.06684793## this_statusInactive long term sick## Inactive care 0.1366876## Inactive long term sick 0.1396415## Inactive other 0.3193084## Inactive retired 0.1541554## Inactive student 0.3425172## Unemployed 0.1363659## this_statusInactive other this_statusInactive retired## Inactive care 0.1592756 0.1466636## Inactive long term sick 0.2969801 0.1883779## Inactive other 0.1990749 0.3651422## Inactive retired 0.2661224 0.1136774## Inactive student 0.2491562 0.6784096## Unemployed 0.1965656 0.2593636## this_statusInactive student this_statusUnemployed## Inactive care 0.13446168 0.06609202## Inactive long term sick 0.25346504 0.11472025## Inactive other 0.22623500 0.21334522## Inactive retired 0.71542075 0.15317899## Inactive student 0.07271687 0.11849251## Unemployed 0.08492230 0.06098611## sexmale splines::bs(age, 5)1 splines::bs(age, 5)2## Inactive care 0.15989693 0.2591058 0.1585449## Inactive long term sick 0.12231001 0.4075431 0.2506371## Inactive other 0.16255927 0.4041402 0.3241556## Inactive retired 0.06643554 3.0747685 1.7525551## Inactive student 0.08630838 0.1423782 0.1863921## Unemployed 0.05046323 0.1494099 0.1097848## splines::bs(age, 5)3 splines::bs(age, 5)4## Inactive care 0.2210071 0.1878868## Inactive long term sick 0.3147839 0.2646640## Inactive other 0.3677834 0.3320895## Inactive retired 1.7793310 1.7047936## Inactive student 0.2968194 0.5222790## Unemployed 0.1403807 0.1309305## splines::bs(age, 5)5 this_statusInactive care:sexmale## Inactive care 0.2237352 0.2332628## Inactive long term sick 0.2865589 0.3279407## Inactive other 0.3618112 0.3354850## Inactive retired 1.7159605 0.3105845## Inactive student 1.0496550 1.0183519## Unemployed 0.1622973 0.2073871## this_statusInactive long term sick:sexmale## Inactive care 0.3333645## Inactive long term sick 0.1970728## Inactive other 0.4744219## Inactive retired 0.2076420## Inactive student 0.5084220## Unemployed 0.1878839## this_statusInactive other:sexmale## Inactive care 0.3286156## Inactive long term sick 0.4137753## Inactive other 0.3012812## Inactive retired 0.4079154## Inactive student 0.3588901## Unemployed 0.2683190## this_statusInactive retired:sexmale## Inactive care 0.3686044## Inactive long term sick 0.2498561## Inactive other 0.4745720## Inactive retired 0.1648817## Inactive student 1.7316038## Unemployed 0.3331755## this_statusInactive student:sexmale## Inactive care 0.4684662## Inactive long term sick 0.3535180## Inactive other 0.2760982## Inactive retired 1.1947170## Inactive student 0.1054531## Unemployed 0.1069839## this_statusUnemployed:sexmale## Inactive care 0.20434462## Inactive long term sick 0.16038453## Inactive other 0.29385131## Inactive retired 0.19076501## Inactive student 0.16731369## Unemployed 0.07832383## ## Residual Deviance: 99364.01 ## AIC: 99592.01
# Select variables to take from all wave-specific datasetsvarnames <-c("jbstat", "dvage", "sex", "hcond17" )vartypes <-c("labels", "values", "labels", "labels" )# Grab from first 11 waves df_ind_hconds <-get_ind_level_vars_for_selected_waves(varnames = varnames, vartypes = vartypes, waves = letters[1:11])# Tidy dataset with Y_T+1, Y_T, X and Zdf_ind_hconds_tidied <- df_ind_hconds |>mutate(across(dvage, function(x) ifelse(x <0, NA, x))) |>mutate(across(hcond17, function(x) {case_when( x =='Mentioned'~TRUE, x =='not mentioned'~FALSE,TRUE~NA ) } ) ) |>rename(has_clinicaldepression = hcond17,age = dvage ) %>%filter(complete.cases(.)) # Run model with exposure term Z, has_clinicaldepressionmod_depression <- nnet::multinom( next_status ~ this_status * sex + splines::bs(age, 5) + has_clinicaldepression,data = df_ind_hconds_tidied )## # weights: 147 (120 variable)## initial value 88083.568807 ## iter 10 value 28309.304681## iter 20 value 25694.480117## iter 30 value 23022.479768## iter 40 value 22334.449757## iter 50 value 22129.818514## iter 60 value 22088.844115## iter 70 value 22077.766178## iter 80 value 22067.862459## iter 90 value 22022.759233## iter 100 value 22007.391395## final value 22007.391395 ## stopped after 100 iterations# Compare exposure model against foundational modelBIC(mod_00, mod_depression)## df BIC## mod_00 114 100676.47## mod_depression 120 45301.22AIC(mod_00, mod_depression)## df AIC## mod_00 114 99592.01## mod_depression 120 44254.78# If the exposure model 'beats' the foundational model (it does) then # create a baseline and counterfactual scenario dataset X_H0 and X_H1# I'm going to use observed data from first wave for X_H0df_baseline <- df_ind_hconds_tidied |>filter(wave =='a')# For X_H1, I'm going to set everyone with clinical depression to no clinical # depressiondf_counterfactual_depressaway <- df_baseline |>mutate(has_clinicaldepression =FALSE)# For X_H0, we can get the probability of each person moving to each possible state# using the predict function on the df_baseline dataset, setting type to 'probs'preds_df_baseline <-predict(mod_depression, newdata = df_baseline, type ="probs")# For X_H1, we can do the same, but with the counterfactual dataset X_H1preds_df_counter <-predict(mod_depression, newdata = df_counterfactual_depressaway, type ="probs")# A bit of base R 'magic' to add up the total number of people predicted to be in each state under both scenariospredictions_summary_matrix <-cbind(# The number 2 indicates do the sum function for each column.# If it were 1 then this would sum for each row (which should add up to 1 in call cases)apply(preds_df_baseline, 2, sum),apply(preds_df_counter, 2, sum))colnames(predictions_summary_matrix) <-c("base", "counterfactual")predictions_summary_matrix## base counterfactual## Employed 19851.4570 19957.7329## Inactive care 2592.5688 2640.7475## Inactive long term sick 1440.9549 1294.5141## Inactive other 165.3931 159.8095## Inactive retired 8443.1883 8458.3688## Inactive student 1889.9852 1885.0025## Unemployed 2019.4527 2006.8245# And we can also calculate the % change in each state when going from the baseline# to counterfactual scenariosim_relative_change <-apply( predictions_summary_matrix, 1, function(x) (100* x / x[1]) ) |>t()sim_relative_change## base counterfactual## Employed 100 100.53536## Inactive care 100 101.85834## Inactive long term sick 100 89.83724## Inactive other 100 96.62405## Inactive retired 100 100.17980## Inactive student 100 99.73637## Unemployed 100 99.37467
base counterfactual
Employed 19851.4570 19957.7329
Inactive care 2592.5688 2640.7475
Inactive long term sick 1440.9549 1294.5141
Inactive other 165.3931 159.8095
Inactive retired 8443.1883 8458.3688
Inactive student 1889.9852 1885.0025
Unemployed 2019.4527 2006.8245
base counterfactual
Employed 100 100.53536
Inactive care 100 101.85834
Inactive long term sick 100 89.83724
Inactive other 100 96.62405
Inactive retired 100 100.17980
Inactive student 100 99.73637
Unemployed 100 99.37467
“If the effects of clinical depression on economic inactivity were fully mitigated, then the size of the economically inactive long-term sick population could be reduced by around a tenth.”1
“Around a tenth of economic inactivity due to long-term sickness is explained by clinical depression.”2
Second example: Improving health in general
Table 1: Effect of moving MH and PH by 1 SD collectively