Introduction to Crosswalking in R

GBD training, July 2025, IHME, University of Washington

Key takeaways

– Crosswalking is the process of adjusting data for known biases.

– R users: To access the CrossWalk package at IHME, see https://hub.ihme.washington.edu/x/aQtzCw

– cw$CWData() formats the data, cw$CWModel() runs the model, and cw$adjust_orig_vals() does the adjustment

– Reproducible examples are available in the Health Metrics Toolbox: https://scicomp-docs.ihme.washington.edu/msca/book/current/

– Thank you Sasha Aravkin and Peng Zheng for creating the package!

Official documentation

Cite the GBD non-fatal capstone appendix

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(24)00757-8/fulltext

– Section 2.3: Data adjustments

– Section 2.5: MR-BRT meta-regression modelling

Overview (session 1)

Motivating examples

Steps

Brief review of prerequisites

-- Probability distributions

-- Linear mixed effects models

Math formulas

Code

Interpreting results

When would I use this?

Any time available data does not match the standard GBD definition of the modeled parameter

Examples from various GBD models
Alternative	Reference
Diagnostic test with unknown sensitivity/specificity	High sensitivity/specificity
Hospital cases of chronic disease	Population-based survey
Acute otitis media	Chronic otitis media
One year recall period	One month recall period
Self-reported height and weight	Measured height and weight

Also use when conducting a network meta-analysis of relative risks (covered in the next session)

Motivating example [1]

Diagnostic test A has sensitivity and specificity of 100 percent; GBD reference

Diagnostic test B has a sensitivity of 50 percent (and SP = 100 percent)

Among all true cases, only 50 percent register as positive according to the test

Observed incidence: 1 case per person-year

What should the adjusted incidence be?

Motivating example [2]

Motivating example [3]

Steps

Find matched pairs of observations from the same population, different measurement methods
Transform all observations into the appropriate probability space (log or logit) using the delta method
By pair, take the difference (“alt minus ref”) and calculate the standard error of the difference
Use these differences as the dependent variable in a mixed effects meta-regression
Use the fitted model to predict the adjustment needed for a non-reference definition or method, and apply the adjustment

Matching by age group

Matched pairs for population-level data

As a default, strict matching by location, sex, GBD age group, and year

Typically, first conduct age-sex splitting to convert into standard GBD categories

If change over time is small, can relax time matching criteria to 5-year intervals (can use interval midpoint for multi-year observations)
If age pattern is flat (rare), can match short age intervals with similar midpoints or use custom overlap criteria

Tradeoff between strictness of matching criteria and number of observations

Matched pairs for individual-level data

Ideal; you’re done.

Common example:

Within one cohort study, multiple diagnostic tests for the same person

Search for validation studies!

Individual-level data don’t have standard errors

But sometimes we aggregate when using multiple studies

What is a standard error?

A standard error is the standard deviation of an uncertainty distribution.

Higher sample size $\rightarrow$ less uncertainty $\rightarrow$ smaller SE
For population-level data, need to calculate a standard error for each observation

A meta-regression model weights observations by inverse variance ($1 / \sigma^2_i$)

Steps (with math) [1]

Step 1. Find matched pairs of observations from the same population, different measurement methods

Step 2. Transform all observations into the appropriate probability space using the delta method

Log transformation for $[0,Inf)$ parameters, like incidence, mortality rates (mx), and remission
Logit transformation for $[0,1]$ parameters, like prevalence or a proportion

Steps (with math) [2]

Step 3. By pair, take the difference (“alt minus ref”) and calculate the standard error of the difference

$y_i = log(\mu_i^{alt}) - log(\mu_i^{ref})$

Step 4. Use these differences as the dependent variable in a mixed effects meta-regression; $E[y_i]=\beta_0$

Step 5. Use the fitted model to predict the adjustment needed for a non-reference definition or method, and apply the adjustment

$\hat{\delta} = \hat{\beta}_0$

$\mu^{adjusted}_i = exp(log(\mu^{alt}_i) - \hat{\delta} )$

Data prep [1] – original data

library(dplyr)
df_orig <- df_tmp %>%
  mutate(
    age_cat = cut(age, breaks = seq(0, 100, by = 5), right = FALSE),
    match_id = paste0(age_cat), # include other matching variables here
    orig_row = 1:nrow(.)
  )

df_orig %>% # just for showing the data frame
  filter(age_cat == "[0,5)") %>%
  select(age, obs_method, incidence1_mean, incidence1_se)

##   age   obs_method incidence1_mean incidence1_se
## 1   2 Diagnostic_B        1.951142      1.396833
## 2   4 Diagnostic_A        3.146722      1.773900

Data prep [2] – example code for matching

method_var <- "obs_method"
gold_def <- "Diagnostic_A"
keep_vars <- c("orig_row", "age", "incidence1_mean", "incidence1_se")

df_matched <- do.call("rbind", lapply(unique(df_orig$match_id), function(i) {
  dat_i <- filter(df_orig, match_id == i) %>% mutate(dorm = get(method_var))
  keep_vars <- c("dorm", keep_vars)
  row_ids <- expand.grid(idx1 = 1:nrow(dat_i), idx2 = 1:nrow(dat_i))
  do.call("rbind", lapply(1:nrow(row_ids), function(j) {
    dat_j <- row_ids[j, ] 
    dat_j[, paste0(keep_vars, "_alt")] <- dat_i[dat_j$idx1, keep_vars]
    dat_j[, paste0(keep_vars, "_ref")] <- dat_i[dat_j$idx2, keep_vars]
    filter(dat_j, dorm_alt != gold_def & dorm_alt != dorm_ref)
  })) %>% mutate(id = i) %>% select(-idx1, -idx2)
}))
# use only for 1 alternative definition
# if running this with multiple alts, remove duplicate indirect comparisons

Data prep [3] – matched data

# to use a grouping variable as a covariate, take the average
# -- not applicable to session 1 (intercept-only model)
df_matched$age <- (df_matched$age_ref + df_matched$age_alt) / 2

# let's check on the result
filter(df_matched, id == "[0,5)")

##       dorm_alt orig_row_alt age_alt incidence1_mean_alt incidence1_se_alt
## 1 Diagnostic_B            1       2            1.951142          1.396833
##       dorm_ref orig_row_ref age_ref incidence1_mean_ref incidence1_se_ref    id
## 1 Diagnostic_A            2       4            3.146722            1.7739 [0,5)
##   age
## 1   3

The delta method [1]

Use the delta method to convert standard errors into/out of log/logit space

Note that the delta_transform() function has been deprecated. Use cw$utils$linear_to_log() and cw$utils$linear_to_logit() instead

library(reticulate)
reticulate::use_python(path_to_python)
cw <- reticulate::import("crosswalk")
cw$utils$linear_to_log(mean = array(c(0.22, 0.11)), sd = array(c(0.1, 0.1)))

## [[1]]
## [1] -1.514128 -2.207275
## 
## [[2]]
## [1] 0.4545455 0.9090909

The delta method [2] - handling zeros

Observations with $\mu=0$ cannot be log or logit transformed

Discuss with your team lead; do not automatically remove
One option is to offset zeros with half the lowest observed non-zero value; alternatively, 1 percent of the median of non-zero values
Do not offset zeros using an extremely low value, e.g. 0.000001

Similarly, $\mu=1$ cannot be logit transformed

Creating the dependent variable

$y_i = log(\mu_i^{alt}) - log(\mu_i^{ref})$

$Var(y_i) = Var(log(\mu_i^{alt})) + Var(log(\mu_i^{alt}))$

df_matched[, c("ref_mean2", "ref_se2")] <- cw$utils$linear_to_log(
  mean = array(df_matched$incidence1_mean_ref), 
  sd = array(df_matched$incidence1_se_ref)
)
df_matched[, c("alt_mean2", "alt_se2")] <- cw$utils$linear_to_log(
  mean = array(df_matched$incidence1_mean_alt), 
  sd = array(df_matched$incidence1_se_alt)
)

df_matched2 <- df_matched %>%
  mutate(
    ydiff_log_mean = alt_mean2 - ref_mean2,
    ydiff_log_se = sqrt(alt_se2^2 + ref_se2^2)
  )

Run the crosswalk model

dat1 <- cw$CWData(
  df = df_matched2,
  obs = "ydiff_log_mean",   # matched differences in log space
  obs_se = "ydiff_log_se",  # SE of matched differences in log space
  alt_dorms = "dorm_alt",   # var for the alternative def/method
  ref_dorms = "dorm_ref",   # var for the reference def/method
  covs = list(),            # list of (potential) covariate column names
  study_id = "id"          # var for random intercepts; i.e. (1|study_id)
)

fit1 <- cw$CWModel(
  cwdata = dat1,           # result of CWData() function call
  obs_type = "diff_log",   # must be "diff_logit" or "diff_log"
  cov_models = list(       # specify covariate details
    cw$CovModel("intercept") ),
  gold_dorm = "Diagnostic_A"  # level of 'ref_dorms' that's the gold standard
)
fit1$fit()

Apply the adjustment [1]

pred_tmp <- fit1$adjust_orig_vals(
  df = df_orig,
  orig_dorms = "obs_method",
  orig_vals_mean = "incidence1_mean", # mean in un-transformed space
  orig_vals_se = "incidence1_se"      # SE in un-transformed space
)
head(pred_tmp)

##   ref_vals_mean ref_vals_sd pred_diff_mean pred_diff_sd data_id
## 1      3.590443    2.635479     -0.6098604     0.162093       0
## 2      3.146722    1.773900      0.0000000     0.000000       1
## 3      6.060489    3.481005     -0.6098604     0.162093       2
## 4      3.614681    1.901232      0.0000000     0.000000       3
## 5      2.863171    1.692091      0.0000000     0.000000       4
## 6      4.148718    2.843693     -0.6098604     0.162093       5

Apply the adjustment [2]

The output of adjust_orig_vals() is:

Mean and standard error (linear space) of the adjusted observation,
Mean and standard error (transformed space) of predicted adjustment, and
A data_id column to guarantee consistency of the row order.

For zeros with alternative definitions in the original data, keep at zero and inflate the uncertainty

Propagating uncertainty

SE of an adjusted data point includes:

– $\sigma_i$: standard error of the original data point

– $\sigma_\hat{p}$: standard error of the predicted adjustment factor

– $\gamma$: variance of between-group heterogeneity

$\sigma_i^* = \sqrt{\sigma_i^2 + \sigma_\hat{p}^2 + \gamma}$

Original data

Final result

Overview (session 2)

Funnel plots

Linear covariates and splines

Dose response plots

Reporting results

Network meta-analysis

Calculating an adjustment by hand

Funnel plot example [1]

Funnel plot example [2]

Points are observations

Y-axis is the observation standard error (SE, or $\sigma_i$)

Higher on y-axis = lower SE = more weight in the model

X-axis is the observation mean

Funnel:

Inside the funnel means the observation mean is between $\hat{\delta}-1.96\sigma_i$ and $\hat{\delta}+1.96\sigma_i$
We expect approximately 5 percent of observations outside the funnel, in the absence of between-study heterogeneity

Funnel plot example [3]

plots <- import("crosswalk.plots")

plots$funnel_plot(
  cwmodel = fit1, 
  cwdata = dat1,
  continuous_variables = list(),
  obs_method = 'Diagnostic_B',
  plot_note = 'Funnel plot example', 
  plots_dir = file.path(getwd(), "plots"), 
  file_name = "funnel_plot_example_v1",
  write_file = TRUE
)

See py_help(plots$funnel_plot) for the Python docstring

Adding a spline covariate [1]

dat2 <- cw$CWData(
  df = df_matched2, obs = "ydiff_log_mean", obs_se = "ydiff_log_se",  
  alt_dorms = "dorm_alt", ref_dorms = "dorm_ref", 
  covs = list("age"), study_id = "id" )

fit2 <- cw$CWModel( 
  cwdata = dat2, obs_type = "diff_log", 
  cov_models = list(       
    cw$CovModel("intercept"),
    cw$CovModel(cov_name = "age",
      spline = cw$XSpline(
        knots = c(min(df_matched2$age), 33, 66, max(df_matched2$age)),
        degree = 2L, l_linear = TRUE, r_linear = FALSE)
    ) ),
  gold_dorm = "Diagnostic_A"
)
fit2$fit()

Adding a spline covariate [2]

Don’t try to interpret the $\beta$ coefficients of a spline!

fit2$create_result_df() %>%
  select(dorms, cov_names, beta, beta_sd, gamma)

##          dorms    cov_names       beta   beta_sd gamma
## 1 Diagnostic_A    intercept  0.0000000 0.0000000     0
## 2 Diagnostic_A age_spline_0  0.0000000 0.0000000   NaN
## 3 Diagnostic_A age_spline_1  0.0000000 0.0000000   NaN
## 4 Diagnostic_A age_spline_2  0.0000000 0.0000000   NaN
## 5 Diagnostic_B    intercept -0.7246240 0.2328387   NaN
## 6 Diagnostic_B age_spline_0 -0.2202743 0.2208264   NaN
## 7 Diagnostic_B age_spline_1  0.4222690 0.9522521   NaN
## 8 Diagnostic_B age_spline_2 -0.4120475 1.1877691   NaN

Dose-response plots [1]

plots <- import("crosswalk.plots")

plots$dose_response_curve(
  dose_variable = 'age',
  obs_method = 'Diagnostic_B', 
  continuous_variables = list(), 
  cwdata = dat2, 
  cwmodel = fit2, 
  plot_note = "Example dose-response plot", 
  plots_dir = file.path(getwd(), "plots/"), 
  file_name = "doseresponse_plot_example_v1", 
  write_file = TRUE
)

Dose-response plots [2]

Plot arguments [3]

Arguments for both funnel_plot and dose_response_plot

obs_method: The alternative definition (e.g. “Diagnostic_B”) for which we want to see the effect size

When other covariates are present in the model, a reference level of 0 is assumed unless…

continuous_variables: variables listed here will use the median value rather than 0
binary_variables: a named list (name of the covariate) with values of 0, 1, “mean” or “median”

Note: data in the plots are adjusted to the reference level!

Reporting results [1]

Use create_result_df() to get a table of estimated coefficients, including gamma (variance of between-group heterogeneity)

But don’t report or try to interpret spline betas!

Use dose-response plots to show non-linear effects.

Reporting results [2]

For intercepts, binary covariates, and linear covariates:

For log-scale models, exponentiate betas to obtain relative risks (e.g. $log(a/b) = log(a)-log(b)$)
For logit-scale models, exponentiate (not inverse logit!) betas to obtain odds ratios

Another name for logit is log-odds (i.e. $log(\mu/(1-\mu))$)

$OR = odds_{alt}/odds_{ref} = exp(log(odds_{alt}) - log(odds_{ref}))$

For standard errors, use the delta method with cw$utils$log_to_linear or cw$utils$logit_to_linear

Network meta-analysis, introduction

Image source: https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2019.00822/full

Simulated example (7.2) from Health Metrics Toolbox

##   altvar refvar logit_diff logit_diff_se group_id
## 1      C      B   4.824921     0.5524550        1
## 2      B      C  -3.309311     0.4719643        1
## 3      C      B   4.035348     0.5238145        1
## 4      B      A  18.119626     0.5445521        1
## 5      C      B  10.743544     0.5142872        1
## 6      C      A  24.837169     0.5588720        1

The variables we pass to cw$CWData(altdorms, ...) and cw$CWData(refdorms, ...) can now take on multiple levels.

Otherwise exactly the same as before.

Network meta-analysis with composite definitions [1]

##   id alt ref logit_prev_diff logit_prev_diff_se
## 1  2 C_D   A      0.54825365                0.5
## 2  4   B   A      0.82377099                0.5
## 3  5   C   B     -0.63901710                0.5
## 4  6   B   A     -0.03491371                0.5
## 5  7   C C_D     -0.54588728                0.5
## 6  8 B_C   A      0.52017736                0.5

Network meta-analysis with composite definitions [2]

dat4 <- cw$CWData(
  df = df4,
  obs = "logit_prev_diff", obs_se = "logit_prev_diff_se",
  alt_dorms = "alt", ref_dorms = "ref",
  dorm_separator = "_",
  covs = list(), study_id = "id"
)

dorm_separator = "_"

Note: the default is " ", so be careful using space characters in your dorms!

Network meta-analysis with an order prior [1]

[Daily smokers]<[Any freq.] & [Cigarettes]<[Any tobacco type]

Network meta-analysis with an order prior [2]

dat_smok1 <- CWData(
  df = df_in1, obs = "ratio_log", obs_se = "ratio_se_log",
  alt_dorms = "alt", ref_dorms = "reference",
  covs = list("age_mid"), study_id = "nid" )

fit_smok1 <- CWModel(
  cwdata = dat_smok1,     # object returned by `CWData()`
  obs_type = "diff_logit", # "diff_log" or "diff_logit" depending on bounds
  cov_models = list(CovModel("intercept")),
  gold_dorm = "smoked_current_any_var",  # gold standard level of ref_dorms
  order_prior = list( # priors (left < right)
    c('cig_current_daily_var', 'cig_current_any_var'),
    c('cig_current_daily_var', 'smoked_current_daily_var'),
    c('cig_current_any_var', 'smoked_current_any_var'),
    c('smoked_current_daily_var', 'smoked_current_any_var') ) )

Network meta-analysis, no intercept and 1 continuous cov. [1]

NMA can include continuous covariates

Network meta-analysis, no intercept and 1 continuous cov. [2]

Effect of breastfeeding on child diarrhea, modified by WaSH

Special options in CrossWalk functions

In cw$CWData():

dorm_separator for when an measurement method is composed of sub-definitions; note that the default is

In cw$CWModel(): order_prior, prior_gamma_uniform and prior_gamma_gaussian

In model_object$adjust_orig_vals():

study_id for when you want to predict using random effects

In cw$CovModel: spline_monotonicity, spline_convexity, prior_beta_uniform, prior_beta_gaussian

Health Metrics Toolbox book

https://scicomp-docs.ihme.washington.edu/msca/book/current/

Key takeaways

– Crosswalking is the process of adjusting data for known biases.

– R users: To access the CrossWalk package at IHME, see https://hub.ihme.washington.edu/x/aQtzCw

– cw$CWData() formats the data, cw$CWModel() runs the model, and cw$adjust_orig_vals() does the adjustment

– Reproducible examples are available in the Health Metrics Toolbox: https://scicomp-docs.ihme.washington.edu/msca/book/current/

– Thank you Sasha Aravkin and Peng Zheng for creating the package!

Calculating an adjustment by hand [1]

In the first example, an intercept-only model, how exactly did we go from this…

##   orig_row   obs_method incidence1_mean incidence1_se
## 1        1 Diagnostic_B        1.951142      1.396833

to this?

##   orig_row   obs_method ref_vals_mean ref_vals_sd
## 1        1 Diagnostic_B      3.590443    2.635479

Calculating an adjustment by hand [2]

Log scale

unlist(fit1$fixed_vars) # estimated coefficients

## Diagnostic_A Diagnostic_B 
##    0.0000000   -0.6098604

Exponentiated (RR scale)

unlist(lapply(fit1$fixed_vars, exp)) # exponentiated

## Diagnostic_A Diagnostic_B 
##    1.0000000    0.5434267

Calculating an adjustment by hand [3]

Adjusting the point estimate (1.951142 –> 3.590443)

Predicted adjustment

(logscale_prediction <- df_orig2[1, "pred_diff_mean"])

## [1] -0.6098604

(adjusted_mean <- exp(log(1.951142) - logscale_prediction))

## [1] 3.590442

Calculating an adjustment by hand [5]

Adjusting the standard error (1.396833 –> 2.635479)

Original SE and prediction SE in log space

(logscale_orig_se <- cw$utils$linear_to_log(
  mean = array(df_orig2[1, "incidence1_mean"]), 
  sd = array(df_orig2[1, "incidence1_se"])
)[[2]])

## [1] 0.7159052

(logscale_pred_se <- df_orig2[1, "pred_diff_sd"])

## [1] 0.162093

Calculating an adjustment by hand [6]

Adjusting the standard error (1.396833 –> 2.635479)

Adjusted SE in log and linear space

(adjusted_se <- sqrt(logscale_orig_se^2 + logscale_pred_se^2 + fit1$gamma))

## [1] 0.7340261

(cw$utils$log_to_linear(mean = array(log(adjusted_mean)), sd = array(adjusted_se))[[2]])

## [1] 2.635478

Formulas for network meta-analysis

Thank you!

Reed Sorensen

rsoren@uw.edu
#friends_of_mr_brt Slack channel

MSCA office hours, Mondays and Wednesdays at 11 a.m. (#msca-office-hour)