Marvin Lim significance of overlaps

library(png)
img <- readPNG("/data/projects/marvin_lim/download.png")
grid::grid.raster(img)

Set	n
Background Genes	26365
Lim	4396
Beltran	70
Cheng	310

To test if the 174 NEPCA signature could be observed by chance in Lim dataset, treat Cheng et al. & Beltran et al. as one set and perform hypergeometric test. The pvalue indiciates that the result is unlikely to happen by chance.

phyper(174 - 1, 4396, 26365 - 4396, 70 + 310, lower.tail = FALSE)

## [1] 1.293393e-40

To test if the 10 gene signature could be observed by chance, simulations were performed. Firstly, a background set of gene IDs were generated (26365 in total). Next, gene sets were derived for Lim (4396), Cheng (310) and Beltran (70) where each gene set had 10 gene IDs that were common to all sets, recapitulating the venn diagram given at the top of the document.

For 100,000 iterations, we randomly sampled (without replacement) the background gene list to produce 3 gene sets of equal length to Lim, Cheng and Beltran. Upon each iteration, the number of overlapping genes between sets was computed to simulate a null distribution. This distribution is then plotted against the observed value (10) to assess if it rejects the null hypothesis. The results of the plot would indicate a very small pvalue.

The p-values for the simulation is the number of simulated values that were greater than or equal to the observed value (plus one), over the number of iterations (plus one).

We can see that the 10 gene signature is unlikely to occur by chance.

library(purrr)
library(tidyverse)

lim_total <- 4396
beltran_total <- 70
cheng_total <- 310

intersection_n <- 10

all_genes <- sprintf("ENSG%08d", seq_len(26365))
common_genes <- sample(all_genes, intersection_n)

lim <- list(lim=c(common_genes, sample(all_genes, lim_total - intersection_n)))
beltran <- list(beltran=c(common_genes, sample(all_genes, beltran_total - intersection_n)))
cheng <- list(cheng=c(common_genes, sample(all_genes, cheng_total - intersection_n)))

sets <- c(lim, beltran, cheng)

observed <- length(reduce(sets, intersect))

iterations <- 100000

simulated <- map_dbl(seq_len(iterations), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))
   return(sim)
})

ggplot() +
  geom_histogram(mapping=aes(x=simulated), binwidth=1, fill="dodgerblue") +
  geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
  theme_bw() +
  theme(text=element_text(size=12)) +
  xlab("Overlap") +
  ylab("Count")

pval <- (sum(simulated >= observed) + 1) / (iterations + 1)

print(pval)

## [1] 9.9999e-06