Introduction to Crosswalking in R

GBD training, June 2021, IHME, University of Washington

Key takeaways

– Crosswalking is the process of taking systematically biased data points and estimating their unbiased values

– Peng created a Python package (github.com/ihmeuw-msca/CrossWalk), and Reed turned it into an R package: library(crosswalk, lib.loc = "/ihme/code/mscm/R/packages/") for R version 3, library(crosswalk002, lib.loc = "/ihme/code/mscm/Rv4/packages/") for R version 4

– CWData() formats meta-regression data, CWModel() runs the model and adjust_orig_vals() does the adjustment

– Tutorials, slides and more available at: https://hub.ihme.washington.edu/display/MSCA/Math+Sciences+Team

Minimal example

dat1 <- CWData(
  df = df_matched, 
  obs = "logit_diff", obs_se = "logit_diff_se", 
  alt_dorms = "dorm_alt",  ref_dorms = "dorm_ref", 
  covs = list("age"), 
  study_id = "id2" )

fit1 <- CWModel(
  cwdata = dat1, 
  obs_type = "diff_logit", 
  cov_models = list(CovModel("intercept")), 
  gold_dorm = "Diagnosed" )

preds1 <- adjust_orig_vals(
  fit_object = fit1, 
  df = df_orig, 
  orig_dorms = "obs_method", 
  orig_vals_mean = "prev", 
  orig_vals_se = "prev_se" )

Overview

How not to do a crosswalk

Preparing data for meta-regression

Basic crosswalk in R

Network meta-analysis in R

Attempt #1

\(y_i = \beta_0 + \beta_1 age_i + \beta_2 dorm_i + \epsilon_i\)

\(y_i^* = y_i - \beta_2 dorm_i \space\space\space\space\space\space dorm_i = \left\{\begin{array}{ll} 1 & self-reported \\ 0 & diagnosed \end{array} \right.\)

Attempt #2

\(logit(y_i) = \beta_0 + \beta_1 age_i + \beta_2 dorm_i + \epsilon_i\) \(y_i^* = logit^{-1}(logit(y_i) - \beta_2 dorm_i)\)

Attempt #3 (GBD method)

Create matched pairs of alternative/reference observations
Model the differences in log or logit space, e.g. \(g(y_{ij}^{alt}) - g(y_{ij}^{ref}) = \beta_0 + u_j + \epsilon_i\)
Predict how much the alternative points should be adjusted

Data prep

Preparing data for meta-regression [1]

# construct matching groups

library(dplyr)
df_orig <- df_tmp %>%
  mutate(
    age_cat = cut(age, breaks = seq(0, 100, by = 10), right = FALSE),
    id = paste0(age_cat), # could include other variables here
    orig_row = 1:nrow(.)
  )

df_orig %>%
  filter(age_cat == "[0,10)") %>%
  select(age, obs_method, prev, prev_se)

##   age    obs_method       prev   prev_se
## 1 3.0 Self-reported 0.03446539 0.1497298
## 2 3.8     Diagnosed 0.03476079 0.1025953
## 3 5.6 Self-reported 0.10096372 0.1218764
## 4 9.2 Self-reported 0.21896641 0.1291145

Preparing data for meta-regression [2]

# get matched pairs -- here's some code you can use

method_var <- "obs_method"
gold_def <- "Diagnosed"
keep_vars <- c("orig_row", "age", "prev", "prev_se")

df_matched <- do.call("rbind", lapply(unique(df_orig$id), function(i) {
  dat_i <- filter(df_orig, id == i) %>% mutate(dorm = get(method_var))
  keep_vars <- c("dorm", keep_vars)
  row_ids <- expand.grid(idx1 = 1:nrow(dat_i), idx2 = 1:nrow(dat_i))
  do.call("rbind", lapply(1:nrow(row_ids), function(j) {
    dat_j <- row_ids[j, ] 
    dat_j[, paste0(keep_vars, "_alt")] <- dat_i[dat_j$idx1, keep_vars]
    dat_j[, paste0(keep_vars, "_ref")] <- dat_i[dat_j$idx2, keep_vars]
    filter(dat_j, dorm_alt != gold_def & dorm_alt != dorm_ref)
  })) %>% mutate(id = i) %>% select(-idx1, -idx2)
}))

# if using this for NMA, need to remove duplicate indirect comparisons

Preparing data for meta-regression [3]

# to use a grouping variable as a covariate, take the average
df_matched$age <- (df_matched$age_ref + df_matched$age_alt) / 2

# let's check on the result
filter(df_matched, id == "[0,10)")

##        dorm_alt orig_row_alt age_alt   prev_alt prev_se_alt  dorm_ref
## 1 Self-reported            1     3.0 0.03446539   0.1497298 Diagnosed
## 2 Self-reported            3     5.6 0.10096372   0.1218764 Diagnosed
## 3 Self-reported            4     9.2 0.21896641   0.1291145 Diagnosed
##   orig_row_ref age_ref   prev_ref prev_se_ref     id age
## 1            2     3.8 0.03476079   0.1025953 [0,10) 3.4
## 2            2     3.8 0.03476079   0.1025953 [0,10) 4.7
## 3            2     3.8 0.03476079   0.1025953 [0,10) 6.5

`delta_transform()`

We use the delta method to convert standard errors into/out of log/logit space

library(crosswalk, lib.loc = "/ihme/code/mscm/R/packages/")

## Loading required package: reticulate

# SE is the standard deviation of the mean estimate (hence `sd =`)
delta_transform(mean = 0.2, sd = 0.1, transformation = "linear_to_logit")

##      mean_logit sd_logit
## [1,]  -1.386294    0.625

# you can pass in vectors; note that SE in logit space depends on mean
delta_transform(mean = c(0.2, 0.1), sd = c(0.1, 0.1), transformation = "linear_to_logit")

##      mean_logit sd_logit
## [1,]  -1.386294 0.625000
## [2,]  -2.197225 1.111111

# equivalent to 
# xwalk$utils$linear_to_logit(mean = array(c(0.2, 0.1)), sd = array(c(0.1, 0.1)))

`calculate_diff()`

\(Var(a + b) = Var(a) + Var(b)\)

\(\sigma_{a+b} = \sqrt{\sigma_a^2 + \sigma_b^2}\)

df_diff <- data.frame(a = 0.5, a_se = 0.1, b = 0.6, b_se = 0.1)

calculate_diff(
  df = df_diff, 
  alt_mean = "a", alt_sd = "a_se",
  ref_mean = "b", ref_sd = "b_se"
)

##   diff_mean   diff_sd
## 1      -0.1 0.1414214

Preparing data for meta-regression [6]

# using delta_transform() and calculate_diff(), get difference in log/logit space
library(crosswalk, lib.loc = "/ihme/code/mscm/R/packages/")

dat_diff <- as.data.frame(cbind(
  delta_transform(
    mean = df_matched$prev_alt, 
    sd = df_matched$prev_se_alt,
    transformation = "linear_to_logit" ),
  delta_transform(
    mean = df_matched$prev_ref, 
    sd = df_matched$prev_se_ref,
    transformation = "linear_to_logit")
))
names(dat_diff) <- c("mean_alt", "mean_se_alt", "mean_ref", "mean_se_ref")

df_matched[, c("logit_diff", "logit_diff_se")] <- calculate_diff(
  df = dat_diff, 
  alt_mean = "mean_alt", alt_sd = "mean_se_alt",
  ref_mean = "mean_ref", ref_sd = "mean_se_ref" )

Preparing data for meta-regression [7]

# let's check on the calculated differences
filter(df_matched, id == "[0,10)")

##        dorm_alt orig_row_alt age_alt   prev_alt prev_se_alt  dorm_ref
## 1 Self-reported            1     3.0 0.03446539   0.1497298 Diagnosed
## 2 Self-reported            3     5.6 0.10096372   0.1218764 Diagnosed
## 3 Self-reported            4     9.2 0.21896641   0.1291145 Diagnosed
##   orig_row_ref age_ref   prev_ref prev_se_ref     id age   logit_diff
## 1            2     3.8 0.03476079   0.1025953 [0,10) 3.4 -0.008840455
## 2            2     3.8 0.03476079   0.1025953 [0,10) 4.7  1.137323815
## 3            2     3.8 0.03476079   0.1025953 [0,10) 6.5  2.052186157
##   logit_diff_se
## 1      5.440103
## 2      3.339566
## 3      3.149579

POP QUIZ!

When would you want to choose Model 2 over Model 1?

\(g(y_{ij}^{alt}) - g(y_{ij}^{ref}) = \beta_0 + u_j + \epsilon_i\)

\(g(y_{ij}^{alt}) - g(y_{ij}^{ref}) = \beta_0 + \beta_1 age_i + u_j + \epsilon_i\)

Basic crosswalk in R

# CWData(), CovModel() and CWModel() functions
df_matched$id2 <- as.integer(as.factor(df_matched$id)) # ...housekeeping

# format data for meta-regression; pass in data.frame and variable names
dat1 <- CWData(
  df = df_matched,
  obs = "logit_diff",       # matched differences in logit space
  obs_se = "logit_diff_se", # SE of matched differences in logit space
  alt_dorms = "dorm_alt",   # var for the alternative def/method
  ref_dorms = "dorm_ref",   # var for the reference def/method
  covs = list("age"),       # list of (potential) covariate columns
  study_id = "id2"          # var for random intercepts; i.e. (1|study_id)
)

fit1 <- CWModel(
  cwdata = dat1,           # result of CWData() function call
  obs_type = "diff_logit", # must be "diff_logit" or "diff_log"
  cov_models = list( # specify covariate details
    CovModel("intercept"), CovModel("age")),
  gold_dorm = "Diagnosed"  # level of 'ref_dorms' that's the gold standard
)

pred_tmp <- adjust_orig_vals(
  fit_object = fit1, 
  df = data.frame(age = 0, obs_method = "Self-reported", prev1 = 0.1, prev1_se = 0.1), 
  orig_dorms = "obs_method", 
  orig_vals_mean = "prev1",
  orig_vals_se = "prev1_se"
)

`CovModel()` and `XSpline()`

CovModel(
  cov_name = "age", 
  spline = XSpline(
    knots = c(0, 33, 66, 100), # outer knots must encompass the data
    degree = 3L, # polynomial order (1=linear, 2=quadratic, 3=cubic)
    l_linear = TRUE, # linearity in the left tail
    r_linear = TRUE ), # linearity in the right tail
  spline_monotonicity = "increasing" # 'increasing' or 'decreasing'
  # spline_convexity = "convex" # 'convex' or 'concave'
)

# adjust_orig_vals() function

df_orig[, c("prev2", "prev2_se", "diff", "diff_se", "data_id")] <- adjust_orig_vals(
  fit_object = fit1,       # result of CWModel()
  df = df_orig,            # original data with obs to be adjusted
  orig_dorms = "obs_method", # name of column with (all) def/method levels
  orig_vals_mean = "prev",  # original mean
  orig_vals_se = "prev_se"  # standard error of original mean
)

df_orig %>%
  select(obs_method, age, prev, prev_se, prev2, prev2_se, diff, diff_se) %>%
  head(2)

##      obs_method age       prev   prev_se      prev2   prev2_se      diff
## 1 Self-reported 3.0 0.03446539 0.1497298 0.01577461 0.07013852 0.8007268
## 2     Diagnosed 3.8 0.03476079 0.1025953 0.03476079 0.10263904 0.0000000
##     diff_se
## 1 0.3943029
## 2 0.0000000

It worked!

Let’s check the calculation

##      obs_method age       prev   prev_se      prev2   prev2_se
## 1 Self-reported   3 0.03446539 0.1497298 0.01577461 0.07013852

fit1$fixed_vars # estimated coefficients

## $Diagnosed
## [1] 0 0
## 
## $`Self-reported`
## [1]  0.807331588 -0.002201601

# prediction = beta0 + beta1*age
# adj_prev_logit = prev_logit - prediction (pop quiz: why subtract?)
inv_logit(logit(0.03446539) - (0.807331588 + -0.002201601 * 3) )

## [1] 0.01577461

Propagating uncertainty

SE of an adjusted data point includes:

– \(\sigma_i\): standard error of the original data point

– \(\sigma_\hat{p}\): standard error of the predicted adjustment factor

– \(\gamma\): variance of between-group heterogeneity

\(\sigma_i^* = \sqrt{\sigma_i^2 + \sigma_\hat{p}^2 + \gamma}\)

# checking that the adjusted SE is larger than the unadjusted in logit space
delta_transform(c(0.0345, 0.0158), c(0.1497, 0.0769), "linear_to_logit")

##      mean_logit sd_logit
## [1,]  -3.331687 4.494180
## [2,]  -4.131819 4.945223

Network meta-analysis

Intercept-only model (not NMA)

A reminder of Peng’s notation…

One covariate model (not NMA)

Network meta-analysis without covariates

Multiple “intercepts” – NOTE: intercepts are okay now!

Smoking example

Smoking matched data

##    ratio_log ratio_se_log              reference                   alt
## 1 -0.9407520    0.5139974 smoked_current_any_var cig_current_daily_var
## 2 -0.6469545    0.2904979 smoked_current_any_var cig_current_daily_var
## 3 -0.2432202    0.2261205 smoked_current_any_var cig_current_daily_var
## 4 -0.2403746    0.1415213 smoked_current_any_var cig_current_daily_var
## 5 -0.6212144    6.6189897 smoked_current_any_var cig_current_daily_var
## 6 -0.2120517    0.1029494 smoked_current_any_var cig_current_daily_var

Note: \(log(alt/ref) = log(alt) - log(ref)\) = “ratio_log”

But we should really be using \(logit(alt) - logit(ref)\). Why?

# specifying the smoking model
df_smok1 <- CWData(
  df = df_in1,            # dataset for metaregression
  obs = "ratio_log",      # column name for the observation mean
  obs_se = "ratio_se_log",# column name for the observation standard error
  alt_dorms = "alt",      # variable with alternative def/method
  ref_dorms = "reference",# variable with reference def/method
  covs = list("age_mid"), # columns that can be used as covariates later
  study_id = "nid" )      # name of the column indicating group membership

fit_smok1 <- CWModel(
  cwdata = df_smok1,     # object returned by `CWData()`
  obs_type = "diff_log", # "diff_log" or "diff_logit" depending on bounds
  cov_models = list(     # specying predictors; see help(CovModel)
    CovModel("intercept")),
  gold_dorm = "smoked_current_any_var",  # gold standard level of ref_dorms
  order_prior = list( # adjustment priors (left < right)
    c('cig_current_daily_var', 'cig_current_any_var'),
    c('cig_current_daily_var', 'smoked_current_daily_var'),
    c('cig_current_any_var', 'smoked_current_any_var'),
    c('smoked_current_daily_var', 'smoked_current_any_var') ) )

Smoking results

fit_smok1$fixed_vars # yep, they're the same

## $cig_current_any_var
## [1] -0.1497231
## 
## $cig_current_daily_var
## [1] -0.1923415
## 
## $smoked_current_any_var
## [1] 0
## 
## $smoked_current_daily_var
## [1] -0.1423697

Network meta-analysis with covariates

Breastfeeding example

Breastfeeding data

##       log_rr        se    exposure alt_dorms ref_dorms
## 1 -0.5618989 0.2182322 0.020489951       any      none
## 2 -1.2377944 0.3165094 0.020489951 exclusive   partial
## 3 -0.4780358 0.2233339 0.005671335       any      none
## 4 -0.8209806 0.2702530 0.005671335       any      none
## 5 -0.8439701 0.1810399 0.006323060       any      none
## 6 -0.3364722 0.4186701 0.813969573       any      none

The health benefit of breastfeeding is higher when water quality is poor (higher exposure)

The quantity of interest is the adjustment itself, the log relative risk associated with a certain breastfeeding level compared to no breastfeeding (protective risk factor)

Breastfeeding model

df_bf1 <- CWData(
  df = df_bf,
  obs = "log_rr",
  obs_se = "se",
  alt_dorms = "alt_dorms",
  ref_dorms = "ref_dorms",
  covs = list("exposure"), # no intercept, so log_rr=0 by definition
  study_id = "nid" )

fit_bf1 <- CWModel(
  cwdata = df_bf1,
  obs_type = "diff_log",
  cov_models = list(CovModel("exposure")),
  gold_dorm = "none",
  order_prior = list(
    c('exclusive', 'predominant'), c('predominant', 'partial'),
    c('partial', 'none'), c('exclusive', 'any'),
    c('any', 'none') ),
  max_iter = 100L, inlier_pct = 1 )

Breastfeeding results [1]

fit_bf1$fixed_vars # yep, they're the same again

## $any
## [1] -1.27761
## 
## $exclusive
## [1] -2.138766
## 
## $none
## [1] 0
## 
## $partial
## [1] -0.01610846
## 
## $predominant
## [1] -1.190125

Breastfeeding results [2]

Results when cov_models = list(CovModel("intercept"), CovModel("exposure"))

## $any
## [1] -0.6184746 -0.5274239
## 
## $exclusive
## [1] -1.0712081 -0.8249359
## 
## $none
## [1] 0 0
## 
## $partial
## [1]  0.0001164524 -0.0205317414
## 
## $predominant
## [1] -1.0766546  0.1354054

Each dorm has its own intercept and slope

Key takeaways

– Crosswalking is the process of taking systematically biased data points and estimating their unbiased values

– CWData() formats meta-regression data, CWModel() runs the model and adjust_orig_vals() does the adjustment

– Tutorials, slides and more available at: https://hub.ihme.washington.edu/display/MSCA/Math+Sciences+Team

`rpubs.com/rsoren/572598`

# thank you

Key takeaways

Minimal example

Overview

Attempt #1

Attempt #2

Attempt #3 (GBD method)

Data prep

Preparing data for meta-regression [1]

Preparing data for meta-regression [2]

Preparing data for meta-regression [3]

delta_transform()

calculate_diff()

Preparing data for meta-regression [6]

Preparing data for meta-regression [7]

POP QUIZ!

Basic crosswalk in R

CovModel() and XSpline()

It worked!

Let’s check the calculation

Propagating uncertainty

Network meta-analysis

Intercept-only model (not NMA)

One covariate model (not NMA)

Network meta-analysis without covariates

Smoking example

Smoking matched data

Smoking results

Network meta-analysis with covariates

Breastfeeding example

Breastfeeding data

Breastfeeding model

Breastfeeding results [1]

Breastfeeding results [2]

Key takeaways

rpubs.com/rsoren/572598

`delta_transform()`

`calculate_diff()`

`CovModel()` and `XSpline()`

`rpubs.com/rsoren/572598`