Notation: \(n_i, i = 1, ..., 4\) = the number of total samples that have stage \(i\) in the training data

\(y_i, i = 1, ..., 4\) = the number of positive calls among patients in the stage \(i\) category

\(w_i, i = 1, ..., 4\) = the stage-weighted multiplier for stage \(i\); i.e. \((w_1, w_2, w_3, w_4) = (0.547, 0.075, 0.218, 0.16)\)

Per training donor, define

\(c_j, j = 1, ..., n\) = the call for each sample \(i\)

\(s_j, j = 1, ..., n\) = the stage of each sample \(i\)

Then, \(y_i = \sum_{j = 1}^n c_j \boldsymbol{1}\left(s_j = i \right)\)

Blended sensitivity can be expressed as:

\[ \sum_{i = 1}^{4} \left(\frac{w_i y_i}{n_i} \right) \]

Weighted sensitivity can be expressed as:

\[ \sum_{j = 1}^{n} \left(\frac{w_{s_j}}{\sum_{j = 1}^{n} w_{s_j}} \right) c_j \\ = \sum_{i = 1}^{4} \left(\frac{w_{i}}{\sum_{j = 1}^{n} w_{s_j}} \right) \sum_{j = 1}^{n}c_j \boldsymbol{1}\left(s_j = i \right) \\ = \sum_{i = 1}^{4} \left(\frac{w_{i} y_i}{\sum_{j = 1}^{n} w_{s_j}} \right) \] The difference between the two estimates lies in the denominator within the sum; \(n_i\) vs \(\sum_{j = 1}^{n} w_{s_j}\).

We can define the ‘true’ sensitivity as what we would observe in an infinitely large validation set with stage-wise proportion of samples equal to the stage-weights \(w_i\): \[ w_i = \lim_{n \to \infty}\frac{n_i}{n} \\ = \lim_{n \to \infty}\frac{ \sum_{j = 1}^{n} \boldsymbol{1}\left(s_j = i \right)}{n} \\ = \lim_{n \to \infty}\frac{ \sum_{j = 1}^{n} \boldsymbol{1}\left(s_j = i \right)}{\sum_{i = 1}^{4} \sum_{j = 1}^{n}\boldsymbol{1}\left(s_j = i \right)} \text{ for } i= 1,...,4 \] We can show that true sensitivity = \(\lim_{n \to \infty} \frac{\sum_{j = 1}^n c_j}{n}\)

Let \(p_i, i = 1, ..., 4\) = the true stage-wise sensitivities in this infinitely large validation set: \[ p_i = \lim_{n \to \infty} \frac{\sum_{j = 1}^{n}c_j \boldsymbol{1}\left(s_j = i \right)}{\sum_{j = 1}^{n} \boldsymbol{1}\left(s_j = i \right)} \\ = \lim_{n \to \infty} \frac{y_i}{n_i} \text{ for } i= 1,...,4 \] Then, the true validation sensitivity is equal to \(\sum_{i = 1}^4 p_i w_i\), exactly how we calculate blended sensitivity. To see that in detail: \[ \text{True sensitivity }= \lim_{n \to \infty} \frac{\sum_{j = 1}^n c_j}{n} \\ = \lim_{n \to \infty} \frac{\sum_{i = 1}^{4} \sum_{j = 1}^{n}c_j \boldsymbol{1}\left(s_j = i \right)}{\sum_{i = 1}^{4} \sum_{j = 1}^{n}\boldsymbol{1}\left(s_j = i \right)} \\ = \lim_{n \to \infty} \sum_{i = 1}^{4} \left( \frac{\sum_{j = 1}^{n}c_j \boldsymbol{1}\left(s_j = i \right)}{\sum_{j = 1}^{n} \boldsymbol{1}\left(s_j = i \right)}\right) \left(\frac{ \sum_{j = 1}^{n} \boldsymbol{1}\left(s_j = i \right)}{\sum_{i = 1}^{4} \sum_{j = 1}^{n}\boldsymbol{1}\left(s_j = i \right)}\right) \\ = \sum_{i = 1}^{4} p_i \times w_i \]

Seems like it would always be advantageous to use blended sensitivity in cross-validation then, given the assumption that the training and test data are drawn from a common distribution - in expectation, blended sensitivity equals true validation sensitivity.

We can also show that weighted sensitivity will always be less or equal to blended sensitivity, and only equal if \(\frac{n_i}{4} = 1/4\). Weighted sensitivity \(\leq\) blended sensitivity iff:

\[ \sum_{j = 1}^{n} w_{s_j} \geq n_{i^{'}} \text{ for all } i^{'} = 1, ..., 4 \\ \therefore \sum_{i = 1}^{4} \sum_{j = 1}^{n} w_{s_j} \boldsymbol{1}\left(s_j = i \right) \geq n_{i^{'}} \\ \therefore \sum_{i = 1}^{4} n_i w_i \geq n_{i^{'}} \\ \therefore \sum_{i = 1}^{4} \left(\frac{n_i}{n}\right) w_i \geq \frac{n_{i^{'}}}{n} \\ \text{ which is always true} \] The above comparison also shows that the only time in which weighted sensitivity = blended sensitivity is when \(\frac{n_i}{n} = 1/4\) for all \(i = 1, ..., 4\). Otherwise, weighted sensitivity is always underestimating the true sensitivity given our assumptions. Given the above notation: blended sensitivity - weighted sensitivity = \[ \sum_{i = 1}^{4} \left(\frac{w_i y_i}{n_i} \right) \left(\frac{n_i - \sum_{j = 1}^{n} w_{s_j}}{\sum_{j = 1}^{n} w_{s_j}} \right) \] - this also gives the expected bias of weighted sensitivity compared to true validation sensitivity.

We can simulate small-sample estimates of blended and weighted sensitivity and estimate how these estimates compare with true sensitivity as follows:

Simulate training donor stage assignments and calls according to these distributions: \[ p_i, i = 1, ...,4 = \text{ the true sensitivities per stage}\\ \text{stage}_j \sim multinomial(w_i) \\ \text{call}_j \sim bernoulli(p_{\text{stage}_j}) \\ \text{for } j = 1,...,n_{training} \]

set.seed(777)
# p = true stage-wise sensitivities in validation set
simulate_calls <- function(n = 1000,
                           p = c(.7, .78, .85, .95),
                           w = c(0.547, 0.075, 0.218, 0.16),
                           seed = 777) {
  set.seed(seed)
  stage_sim <- apply(stats::rmultinom(n = n, size = 1, prob = w), 2, which.max)
  call_sim <- sapply(1:length(stage_sim), \(i) stats::rbinom(n = 1, size = 1, prob = p[stage_sim[i]]))

  stage_names <- c("Stage I", "Stage II", "Stage III", "Stage IV")
  sim_data <- data.frame(stage = stage_names[stage_sim], calls = call_sim)
  return(sim_data)
}

# testing function
call_sim <- simulate_calls()

#check that the proportions of simulated stages in the training set are close to w:
table(call_sim$stage)/length(call_sim$stage)
## 
##   Stage I  Stage II Stage III  Stage IV 
##     0.551     0.065     0.241     0.143
#check that the simulated stage-wise sensitivities are close to p: 
call_sim |>
    dplyr::group_by(stage) |>
    dplyr::summarise(sensitivity = mean(calls))
## # A tibble: 4 × 2
##   stage     sensitivity
##   <chr>           <dbl>
## 1 Stage I         0.706
## 2 Stage II        0.831
## 3 Stage III       0.859
## 4 Stage IV        0.965

Function to calculate weighted sensitivity:

calculate_weighted_sensitivity <- function(df,
                                           w = c(0.547, 0.075, 0.218, 0.16)) {
  weighted_df <-
    data.frame(
      stage = c("Stage I", "Stage II", "Stage III", "Stage IV"),
      weights = w
    ) |>
    dplyr::mutate(weights = weights / sum(weights))

  call_sim |>
    dplyr::left_join(weighted_df, by = "stage") |>
    dplyr::group_by(stage) |>
    dplyr::mutate(weights = weights / sum(weights)) |>
    dplyr::ungroup() |>
    dplyr::summarise(sensitivity = sum(calls * weights))

  df |>
    dplyr::left_join(weighted_df, by = "stage") |>
    dplyr::summarise(sensitivity = stats::weighted.mean(x = calls, w = weights)) |>
    dplyr::pull(sensitivity)
}

# testing function
calculate_weighted_sensitivity(call_sim)
## [1] 0.7441615

Function to calculate blended sensitivity:

calculate_blended_sensitivity <- function(df,
                                          w = c(0.547, 0.075, 0.218, 0.16)) {
  df |>
    dplyr::group_by(stage) |>
    dplyr::summarise(sensitivity = mean(calls)) |>
    dplyr::left_join(
      data.frame(
        stage = c("Stage I", "Stage II", "Stage III", "Stage IV"),
        weights = w
      ),
      by = "stage"
    ) |>
    dplyr::mutate(w_sensitivity = sensitivity * weights) |>
    dplyr::summarise(
      sensitivity = sum(w_sensitivity)
    ) |>
    dplyr::pull(sensitivity)
}

# testing function
calculate_blended_sensitivity(call_sim)
## [1] 0.7901341

Calculate true sensitivity:

# stage-wise sensitivity:
p <- c(.7, .78, .85, .95)

# stage-weights
w <- c(0.547, 0.075, 0.218, 0.16)

true_sens <- sum(p * w)
## [1] "true sensitivity: 0.7787"
## [1] "blended sensitivity - true sensitivity: 0.0114"
## [1] "weighted sensitivity - true sensitivity: -0.0345"

Function to repeat simulation over repeats:

repeat_sim <- function(reps = 1000,
                       n = 1000,
                       p = c(.7, .78, .85, .95),
                       w = c(0.547, 0.075, 0.218, 0.16),
                       w_train = c(0.547, 0.075, 0.218, 0.16)) {
  furrr::future_map(1:reps, \(i){
    call_sim <- simulate_calls(
      n = n,
      p = p,
      w = w_train,
      seed = i
    )
    weighted_sens <- calculate_weighted_sensitivity(call_sim, w = w)
    blended_sens <- calculate_blended_sensitivity(call_sim, w = w)
    return(data.frame(weighted_sens = weighted_sens, blended_sens = blended_sens))
  },
  .options = furrr::furrr_options(seed = T)
  ) |>
    dplyr::bind_rows() |>
    dplyr::mutate(true_sens = sum(w * p))
}

Evaluate simulation with different number of training samples:

sim_n10e3 <- repeat_sim(n = 1000) |> dplyr::mutate(n = 1000)
sim_n10e4 <- repeat_sim(n = 10000) |> dplyr::mutate(n = 10000)
sim_n10e5 <- repeat_sim(n = 50000) |> dplyr::mutate(n = 50000)

sim_concat <- dplyr::bind_rows(sim_n10e3, sim_n10e4, sim_n10e5) |>
  dplyr::mutate(n = paste0("n = ", n))

Plotting function

plotting_fun <- function(sim_data) {
  reshape2::melt(sim_data |> dplyr::select(-true_sens), id = "n") |>
    ggplot2::ggplot(ggplot2::aes(x = variable, y = value)) +
    ggplot2::geom_boxplot() +
    ggplot2::facet_wrap(~n) +
    ggplot2::geom_abline(ggplot2::aes(intercept = sim_data$true_sens[1], slope = 0),
      linetype = "dashed", col = "red"
    ) +
    ggplot2::theme_minimal() +
    ggplot2::xlab("") +
    ggplot2::ylab("Estimated sensitivity")
}

plotting_fun(sim_concat)

Evaluate results with higher true stage-wise sensitivities and when the stage-wise proportions in the training set do not equal the validation stage-wise proportions (which are used as the stage-weights)

new_p <- c(.8, .86, .91, .98)
w_train = c(.4, .2, .15,.25)
sim_n10e3_new <- repeat_sim(n = 1000, p = new_p, w_train  = w_train) |> dplyr::mutate(n = 1000)
sim_n10e4_new <- repeat_sim(n = 10000, p = new_p, w_train  = w_train) |> dplyr::mutate(n = 10000)
sim_n10e5_new <- repeat_sim(n = 50000, p = new_p, w_train = w_train ) |> dplyr::mutate(n = 50000)

sim_concat_new <- dplyr::bind_rows(sim_n10e3_new, sim_n10e4_new, sim_n10e5_new) |>
  dplyr::mutate(n = paste0("n = ", n))

plotting_fun(sim_concat_new)

The 2 estimates are only equivalent when the stage-wise proportions in the training set are all equal (i.e. the norm of w_train is the lowest it possibly can be, and equal to 1/4). If the validation set has those same equal stage-wise proportions, the same result holds, but this is not necessary for equivalence.

new_p <- c(.8, .86, .91, .98)
w_train = c(.25, .25, .25, .25)
sim_n10e3_new <- repeat_sim(n = 1000, p = new_p, w_train  = w_train) |> dplyr::mutate(n = 1000)
sim_n10e4_new <- repeat_sim(n = 10000, p = new_p, w_train  = w_train) |> dplyr::mutate(n = 10000)
sim_n10e5_new <- repeat_sim(n = 50000, p = new_p, w_train = w_train ) |> dplyr::mutate(n = 50000)

sim_concat_new <- dplyr::bind_rows(sim_n10e3_new, sim_n10e4_new, sim_n10e5_new) |>
  dplyr::mutate(n = paste0("n = ", n))

plotting_fun(sim_concat_new)

However, even if the validation set stage-wise proportions (used as stage-weights) are all equal to .25, only blended sensitivity accurately estimates true sensitivity whereas weighted-sensitivity does not.

new_p <- c(.8, .86, .91, .98)
new_w = c(.25, .25, .25, .25)
sim_n10e3_equal <- repeat_sim(n = 1000, p = new_p, w = new_w) |> dplyr::mutate(n = 1000)
sim_n10e4_equal <- repeat_sim(n = 10000, p = new_p, w = new_w) |> dplyr::mutate(n = 10000)
sim_n10e5_equal <- repeat_sim(n = 50000, p = new_p, w = new_w) |> dplyr::mutate(n = 50000)

sim_concat_equal <- dplyr::bind_rows(sim_n10e3_equal, sim_n10e4_equal, sim_n10e5_equal) |>
  dplyr::mutate(n = paste0("n = ", n))

plotting_fun(sim_concat_equal)