Deep Patel

Contents

Report

1. Introduction

This contains the logics and the reasoning used for benchmarking single-cell data integration.

1.1 What do we mean by single-cell data integration?

By integrating the single-cell data, we mean merging two different single-cell data based on their transcriptomics data and removing the batch effects between them. The batch effect could be in any form such as: experimental procedure, species, strains etc..

1.2 What do we mean by benchmarking?

Benchmarking means scoring the integration technique. Scoring in the terms of:

  • batch mixing
  • cell-type mixing
1.2.1 Batch mixing (Example)

Consider: We are integrating Zebrafish single-cell RNA-seq data with Medaka single-cell RNA-seq data. After integrating both of them by using monocle3, we want to see for a particular cell-type (such as Notochord) how well are they mixing.

1.2.2 Cell-type mixing (Example)

Consider: We are integrating Zebrafish single-cell RNA-seq data with Medaka single-cell RNA-seq data. After integrating both of them by using monocle3, we want to see whether there is mixing of different cell-types. Such as: For CNS cell-type, we can see that there is mixing of Retina with some of the part of CNS clusters.

2. Scores

We have use three scores for benchmarking:

  • Adjusted Rand Index (ARI)
  • batch Local Simpson Index (bLISI)
  • Distributions of Batches (DB)

2.1 Adjusted Rand Index (ARI)

UMAP Consider the above as an UMAP for species 1 with the cell-types annotation give in the right hand side of the image. There are just two cell-type: white and red. UMAP Consider the above as an UMAP for species 2 with the cell-types annotation give in the right hand side of the image. There are just two cell-type: white and red.

Lets integrate species1 and species2 using monocle3.

In how many ways they can be integrated?

  • In Case-1 : Red cell-type from species are together, which means they are somehow-similar but the White cell-type are not mixing meaning they are different. Example: (If Dorsal fin is the cell-type) Dorsal fin of Zebrafish is different from Medaka thus they are seperated but Retina is almost similar between Zebrafish and Medaka.

  • In Case-2: Here, White cell-type is mixing but Red cell-type isn’t (Just the opposite case of Case-1)

  • In Case-3: Red and White both cell-types are mixing with each other from both species (i.e. species 1 and species 2)

  • In Case-4: Both the cell-types (Red and white) from both the species are still separted even after integrating which means even though the name of the cell-types are same but transcriptomically they are different!

Big Question: How to quantify such cases?

Answer is ARI 🔥

Consider: We have 5 cells (2 from species-1 and 3 from species-2) from White cell-type (which I labelled as 1) and Red cell-type (which I labelled as 2).

Ground Truth
ID Type
C1 1
C2 1
C3 1
C4 1
C5 1
After Integration
ID Type
C1 1
C2 2
C3 1
C4 1
C5 2

In the left table labelled as Ground Truth gives us the detail of the annotation of cells from both species before integration and right table labelled as After Integration which describes the annotation of the cells after integration.

Now we will select all the possible pairs of cells i.e. (C1, C2), (C2, C3), … (C4, C5) : Total 10 pairs and create a truth table as shown in the table below:

Pairs Before Integration After Integration
C1 C2 1 0
C1 C3 1 1
C1 C4 1 1
C1 C5 1 0
C2 C3 1 0
C2 C4 1 0
C2 C5 1 1
C3 C4 1 1
C3 C5 1 0
C4 C5 1 0

Description of the above truth table: 1 means that they belong to same cell-type group and 0 means the otherwise.

Now, we will select all those pairs which were in the same cell-type group before and after integration and calculation the ratio of such pairs by total pairs.

For instance in this case it’s 4/10.

We saw thee score for cell-type mixing but what about batch mixing (or species mixing) ??

2.2 batch Local Simpson Index (bLISI)

For batch mixing (species mixing), we have a score called as bLISI score.

Consider again the two species UMAP that we had before. After integration they looks like the figure(a) below.

Now we will select a cell (mark as red in color, figure(b)) and then we will select k=6 neighbours of the cell (as shown in figure(c)).

Thereafter we will calculate the ratio of the cells from species-1 and species2 in the neighbours. For example in the given case, the ratio of the cells from species-1 is 4/6 and ratio of the the cells from species-2 is 2/6 .

Let \(p_1=\frac{4}{6}=\frac{2}{3}\) and \(p_2=\frac{2}{6}=\frac{1}{3}\). The bLISI score is

\[ \mathrm{bLISI} = \frac{1}{p_1^2 + p_2^2} = \frac{1}{\left(\tfrac{2}{3}\right)^2 + \left(\tfrac{1}{3}\right)^2} = \frac{1}{\tfrac{4}{9} + \tfrac{1}{9}} = \frac{1}{\tfrac{5}{9}} = \frac{9}{5} = 1.8 . \]

More generally, with \(S\) species/batches and \(p_s\) the fraction of the \(k\)-NN from species \(s\),

\[ \mathrm{bLISI}(x) = \left( \sum_{s=1}^{S} p_s(x)^2 \right)^{-1}, \qquad \sum_{s=1}^{S} p_s(x)=1. \]

And finally after getting the score out, we will normalize the score from 0 to 1, by subtracting 1 from the total score.

Logic for Normalization: \[ \mathrm{bLISI}_{\mathrm{norm}} = \frac{\mathrm{bLISI} - 1}{S - 1}. \]

Some thinking 🧠

Consider the following cases in the image, and get the intution behind the scores.

The score does give some intution about the batch mixing in the integrated dataset but something is kinda missing!

The score doesn’t exactly tell us about what percentage of the cells in species-1 are mixing with cells from other speices. Also what percentage of the cells are forming clusters with in a cell-type

2.3 Distribution of Batches (DB)

Deutche Bahn (DB score) is the answer to the limitations of bLISI score!!

What’s this?

Again consider the same example and shown in the following image.

In bLISI, we calculated the simpson index but here we will do something different!

So, we will allocate +1 value to all the cells from species-1 and -1 value to all the cells from species-2. we will calculate the total value by adding up the values of the cells from the neighbour of the red cell (as shown in the figure) and calculating the average! We will repeat the process for all the cells and get the distribution curves to understand the situation.

The cells which have score of zero means that there are equal number of neigbours from both the species. But if the score is around +1 then the neigbours are from speices-1 and if it’s -1 then the neigbours are from species-2. So the score for species-1 will be the ratio of the cells (of species-1) which have score of 0 to the total number of the cells from species-1. Similarly for species-2 it will be the ratio of the cells (of species-2) which have score of 0 to the total number of the cells from species-2

Lets look at the example

  • Case 1

In this case, we only have cells from species-1, so scores from all the cells will be 1, thus a single distribution bump at +1

  • Case 2 In this case, we only have cells from species-2, so scores from all the cells will be -1, thus a single distribution bump at -1

  • Case 3 In this case, we have cells from both the species, so we go two bumps around +1 and -1. But there is a small bump at 0, which means there are few cells which have different neigbours which means they are mixed with each other. So final DB score for species-1 will be the number of the cells in the black bump to the total number of cells in species-1 and similary we can get DB score for species-2

3 Conclusion

We saw the scores for cell-type mixing and batch-mixing which basically cover every aspect of the integration and can be used to comprehend the situation about the integration!!

Code

compute_ari (example – shown only, not executed)

compute_ari <- function(
  cds1, cds2, combined_cds,
  cell_type_name,
  clusters_of_interest,
  annotation_col,          # e.g. "major_group"
  cluster_col = "cluster"  # e.g. "cluster" or "integrated_snn_res.0.5"
) {
  # ---- checks ----
  stopifnot(is.character(annotation_col), length(annotation_col) == 1)
  stopifnot(is.character(cluster_col), length(cluster_col) == 1)

  if (!(annotation_col %in% colnames(SummarizedExperiment::colData(cds1)))) {
    stop("'", annotation_col, "' not found in colData(cds1).")
  }
  if (!(annotation_col %in% colnames(SummarizedExperiment::colData(cds2)))) {
    stop("'", annotation_col, "' not found in colData(cds2).")
  }
  if (!(cluster_col %in% colnames(SummarizedExperiment::colData(combined_cds)))) {
    stop("'", cluster_col, "' not found in colData(combined_cds).")
  }

  # ---- ground truth (binary) for cds1 & cds2 ----
  df1 <- data.frame(
    cell = colnames(cds1),
    cell_type = ifelse(
      SummarizedExperiment::colData(cds1)[[annotation_col]] %in% cell_type_name,
      cell_type_name,
      paste0("non-", cell_type_name)
    ),
    stringsAsFactors = FALSE
  )

  df2 <- data.frame(
    cell = colnames(cds2),
    cell_type = ifelse(
      SummarizedExperiment::colData(cds2)[[annotation_col]] %in% cell_type_name,
      cell_type_name,
      paste0("non-", cell_type_name)
    ),
    stringsAsFactors = FALSE
  )

  ground_truth_df <- rbind(df1, df2)

  # ---- clusters from combined_cds ----
  clusters_df <- data.frame(
    cell = colnames(combined_cds),
    cluster = SummarizedExperiment::colData(combined_cds)[[cluster_col]],
    stringsAsFactors = FALSE
  )

  # ---- merge & label ----
  merged_df <- dplyr::inner_join(ground_truth_df, clusters_df, by = "cell")

  merged_df$cluster_label <- ifelse(
    merged_df$cluster %in% clusters_of_interest,
    cell_type_name,
    paste0("non-", cell_type_name)
  )

  merged_df <- dplyr::filter(merged_df, !is.na(cell_type), !is.na(cluster_label))

  # ---- ARI ----
  ari_score <- NA_real_
  if (nrow(merged_df) > 0 &&
      dplyr::n_distinct(merged_df$cell_type) > 1 &&
      dplyr::n_distinct(merged_df$cluster_label) > 1) {
    ari_score <- ARI(merged_df$cell_type, merged_df$cluster_label)
  } else {
    message("Cannot compute ARI: not enough classes or data.")
  }

  return(ari_score)
}

compute_ari(
  cds1 = ref_cds,
  cds2 = hni_cds,
  combined_cds = combined_cds,
  cell_type_name = "notochord",
  clusters_of_interest = group_clusters,
  annotation_col = "major_group"
)

Output (static):

0.926944958578232

compute_blisi_for_celltype (shown only, not executed)

compute_blisi_for_celltype <- function(
  combined_cds,
  cell_type_name,
  metadata_col,              # pass "major_group" or whatever
  batch_col = "dataset",
  embedding_key = "UMAP",    # flexible choice of embedding
  batch_number = 2
) {
  # -------------------------
  # Extract metadata
  # -------------------------
  meta <- as.data.frame(colData(combined_cds))
  meta$cell <- rownames(meta)
  
  # Create binary cell type label
  meta$cell_type_binary <- ifelse(
    meta[[metadata_col]] == cell_type_name,
    cell_type_name,
    paste0("non-", cell_type_name)
  )
  
  # Keep only cells of interest
  cells_of_interest <- meta %>%
    filter(cell_type_binary == cell_type_name) %>%
    pull(cell)
  
  if (length(cells_of_interest) == 0) {
    warning("No cells found for cell type: ", cell_type_name)
    return(NA)
  }
  
  # -------------------------
  # Prepare embedding coordinates
  # -------------------------
  embedding <- reducedDims(combined_cds)[[embedding_key]]
  rownames(embedding) <- colnames(combined_cds)
  
  embedding_subset <- embedding[cells_of_interest, , drop = FALSE]
  
  # -------------------------
  # Prepare metadata for LISI
  # -------------------------
  metadata_subset <- meta %>%
    filter(cell %in% cells_of_interest) %>%
    select(cell, batch = all_of(batch_col))
  
  # Check for missing batches
  if (any(is.na(metadata_subset$batch))) {
    warning("Missing batch labels detected. Removing NAs.")
    metadata_subset <- metadata_subset %>% filter(!is.na(batch))
    embedding_subset <- embedding_subset[metadata_subset$cell, , drop = FALSE]
  }
  
  # -------------------------
  # Compute LISI
  # -------------------------
  lisi_scores <- compute_lisi(
    embedding_subset,
    metadata_subset,
    label_colnames = "batch"
  )
  
  # Compute mean bLISI (normalized)
  blisi_mean <- mean(lisi_scores$batch, na.rm = TRUE)
  
  return((blisi_mean - 1) / (batch_number - 1))
}

compute_blisi_for_celltype(
  combined_cds = combined_cds,
  cell_type_name = "notochord",
  metadata_col = "major_group",   # <-- same idea as before
  batch_col = "dataset",
  embedding_key = "UMAP",
  batch_number = 2
)

Output (static):

0.225712024871021

calculate_db_score (shown only, not executed)

calculate_db_score <- function(
  cds,
  group_value,                          # e.g. "CNS"
  group_col      = "deep_annotation",   # column to filter groups
  dataset_col    = "dataset",           # column holding dataset/source labels
  positive_val   = "ref",               # value in dataset_col treated as +1
  neutral_val    = "all",               # value used in the X-count denominator
  k              = 2,
  reduction_method = "UMAP",
  neighbor_graph   = "knn",
  score_name       = "neighbor_avg_ref_score"
) {
  # ---- checks ----
  meta <- SummarizedExperiment::colData(cds)
  if (!(group_col %in% colnames(meta))) {
    stop("'", group_col, "' not found in colData(cds).")
  }
  if (!(dataset_col %in% colnames(meta))) {
    stop("'", dataset_col, "' not found in colData(cds).")
  }

  # ---- Step 1: Filter cells by group ----
  keep <- which(!is.na(meta[[group_col]]) & meta[[group_col]] == group_value)
  if (length(keep) == 0) {
    warning("No cells found for ", group_col, " == '", group_value, "'.")
    return(c(db_score = NA_real_, X = NA_real_, Y = NA_real_))
  }
  cds_sub <- cds[, keep, drop = FALSE]

  # ---- Step 2: Spatial weights using k-NN ----
  lw <- monocle3:::calculateLW(
    cds              = cds_sub,
    k                = k,
    verbose          = FALSE,
    neighbor_graph   = neighbor_graph,
    reduction_method = reduction_method,
    list(method = "nn2")
  )

  # ---- Step 3: Neighbor lists & weights (by cell name) ----
  cell_names <- attr(lw, "region.id")
  neighbors  <- lw$neighbours
  weights    <- lw$weights
  names(neighbors) <- names(weights) <- cell_names

  # ---- Step 4: Dataset membership vector (+1 for positive_val, -1 otherwise) ----
  ds_vec <- as.character(SummarizedExperiment::colData(cds_sub)[[dataset_col]])
  x <- ifelse(ds_vec == positive_val, 1, -1)
  names(x) <- colnames(cds_sub)

  # ---- Step 5: Weighted average of neighbors ----
  avg_neighbor_x <- numeric(length(x))
  names(avg_neighbor_x) <- names(x)

  for (cell in names(x)) {
    nbs <- neighbors[[cell]]
    if (is.null(nbs) || length(nbs) == 0) {
      avg_neighbor_x[cell] <- NA_real_
      next
    }
    # neighbors are given as indices into cell_names; convert to names
    nb_names <- cell_names[nbs]
    # intersect to be safe
    nb_names <- intersect(nb_names, names(x))
    if (length(nb_names) == 0) {
      avg_neighbor_x[cell] <- NA_real_
      next
    }
    w <- weights[[cell]]
    # match weights to kept neighbors just in case
    if (length(w) != length(nbs)) {
      w <- w[seq_along(nb_names)]
    }
    avg_neighbor_x[cell] <- sum(w * x[nb_names]) / sum(w)
  }

  # ---- Step 6: Store score in colData (subset object) ----
  SummarizedExperiment::colData(cds_sub)[[score_name]] <- avg_neighbor_x[colnames(cds_sub)]

  # ---- Step 7: Counts for DB score (using provided dataset values) ----
  df <- data.frame(
    cell  = colnames(cds_sub),
    score = SummarizedExperiment::colData(cds_sub)[[score_name]],
    ds    = SummarizedExperiment::colData(cds_sub)[[dataset_col]],
    stringsAsFactors = FALSE
  )

  # X: among neutral_val dataset; Y: among positive_val dataset
  X_count <- sum(df$ds == neutral_val, na.rm = TRUE)
  Y_count <- sum(df$ds == positive_val, na.rm = TRUE)
  x_count <- sum(df$score == 0 & df$ds == neutral_val, na.rm = TRUE)
  y_count <- sum(df$score == 0 & df$ds == positive_val, na.rm = TRUE)

  # avoid divide-by-zero
  X_part <- if (X_count > 0) x_count / X_count else NA_real_
  Y_part <- if (Y_count > 0) y_count / Y_count else NA_real_

  db_score <- X_part + Y_part

  # return components
  return(c(db_score = db_score, X = X_part, Y = Y_part))
}

calculate_db_score(
  cds          = combined_cds,
  group_value  = "notochord",
  group_col    = "major_group",  # where "CNS" lives
  dataset_col  = "dataset",          # e.g. "ref"/"all"/...
  positive_val = "ref",              # +1 group
  neutral_val  = "hni",              # used for X-part
  k = 2,
  reduction_method = "UMAP",
  neighbor_graph   = "knn",
  score_name       = "neighbor_avg_ref_score"
)

Output (static):

db_score 0.421664267488771
X        0.308270676691729
Y        0.113393590797042