This vignette demonstrates the core algorithm behind CORESH search engine for querying public gene expression datasets based on a user-provided gene signature (https://alserglab.wustl.edu/coresh/). CORESH ranks the datasets based on the level of coregulation of user-provided genes using a score inspired by Principal Component Analysis, which can be applied to any gene expression matrix. Currently, CORESH operates on a compendium of more than 40,000 mouse and 40,000 human gene expression datasets from the GEO database, including datasets from both microarray and RNA-seq profiling.
All of the required R dependencies can be installed by installing coresh
R package:
library(remotes)
install_github("alserglab/coresh")
The preprocessed compendium of GEO datasets can be downloaded from the Synapse project syn66227307. Notice, that to download the files you will need to register at Synapse first.
After user registration, you can install the official Python client (https://python-docs.synapse.org/en/stable/tutorials/installation/).
Install the client with pip
using the command line:
pip install --upgrade synapseclient
Then log in to the client (you can use a personal access token, which can be created at https://accounts.synapse.org/authenticated/personalaccesstokens):
synapse config
Finally, download the files from the project
synapse get -r syn66227307
If everything finished correctly, a preprocessed_chunks
folder should appear.
tree -n preprocessed_chunks | head
## preprocessed_chunks
## ├── SYNAPSE_METADATA_MANIFEST.tsv
## ├── hsa
## │ ├── SYNAPSE_METADATA_MANIFEST.tsv
## │ ├── chunk_10_full_objects.qs2
## │ ├── chunk_11_full_objects.qs2
## │ ├── chunk_12_full_objects.qs2
## │ ├── chunk_13_full_objects.qs2
## │ ├── chunk_14_full_objects.qs2
## │ ├── chunk_15_full_objects.qs2
Now we can switch to R and take a look at the data. In this vignette we will be working with the human data.
The data is split into around hundred chunks:
chunkPaths <- list.files("./preprocessed_chunks/hsa/",
pattern="full_objects.qs2",
full.names = TRUE)
print(length(chunkPaths))
## [1] 89
Each chunk consists from around 500 objects and can be loaded using qs2
packages:
library(qs2)
chunk <- qs_read(chunkPaths[[1]])
print(length(chunk))
## [1] 500
Each object corresponds to a GEO dataset.
Note that E1024
attribute contains a processed gene expression matrix: it is centered,
potentially reduced with a Principal Component Analysis, multiplied by 1024 and rounded to
the nearest integer.
str(chunk[[1]])
## List of 8
## $ rownames : int [1:10000] 9 12 14 16 18 19 20 21 22 23 ...
## $ samples : chr [1:10(1d)] "GSM15785" "GSM15786" "GSM15787" "GSM15788" ...
## $ nsamples : int 10
## $ totalVar : num 3096
## $ E1024 : int [1:10000, 1:10] 28 51 120 -268 -368 -18 196 318 114 6 ...
## $ gseId : chr "GSE1000"
## $ gplId : chr "GPL96"
## $ wordMatrix: num [1:15, 1:10] -0.158 -0.158 0.632 0.387 0.387 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:15] "culture" "terminated" "sample" "acid" ...
## .. ..$ : chr [1:10] "GSM15785" "GSM15786" "GSM15787" "GSM15788" ...
As a test query gene set we will use “HALLMARK_HYPOXIA” gene set from the MSigDB database:
library(msigdbr)
hallmarks <- msigdbr(species="human", collection = "H")
query <- hallmarks |>
dplyr::filter(gs_name == "HALLMARK_HYPOXIA") |>
dplyr::pull(ncbi_gene)
query <- as.integer(query) # gene IDs are stored as integers
str(query)
## int [1:200] 57007 133 136 205 9590 226 229 230 272 51129 ...
Let’s define a function to match an object against the query gene set, which will
rely on geseca
method from the fgsea
package:
library(fgsea)
library(data.table)
coreshMatch <- function(obj, query, calculatePvalues=FALSE) {
E <- obj$E1024/1024
queryIdxs <- na.omit(match(query, obj$rownames))
k <- length(queryIdxs)
curProfile <- colSums(E[queryIdxs, , drop=FALSE])
queryVar <- sum(curProfile**2)
if (calculatePvalues) {
gesecaPval <- fgsea:::gesecaCpp(E, queryVar, k, sampleSize=21, seed=1, eps = 1e-300)[[1]]$pval
} else {
gesecaPval <- NA
}
res <- data.table(
gse=obj$gseId,
gpl=obj$gplId,
pctVar=queryVar / k / obj$totalVar * 100,
pval=gesecaPval,
size=k
)
return(res)
}
Testing on the first dataset:
coreshMatch(chunk[[1]], query, calculatePvalues = TRUE)
## gse gpl pctVar pval size
## <char> <char> <num> <num> <int>
## 1: GSE1000 GPL96 0.3102065 4.903416e-11 165
Before running it on the whole compendium, set up the parallel back-end appropriate to your machine. The settings could be different from the ones below.
library(BiocParallel)
bpparam <- BiocParallel::MulticoreParam(8, progressbar = TRUE)
Without P-value calculation, the whole process should take 10-20 seconds, depending on your machine.
varRanking <- rbindlist(bplapply(chunkPaths, function(chunkPath) {
chunk <- qs_read(chunkPath)
chunkRanking <- rbindlist(lapply(chunk, coreshMatch, query=query))
chunkRanking
}, BPPARAM = bpparam))
varRanking <- varRanking[order(pctVar, decreasing=TRUE)]
head(varRanking)
## gse gpl pctVar pval size
## <char> <char> <num> <lgcl> <int>
## 1: GSE131379 GPL18573 7.452995 NA 147
## 2: GSE198308 GPL24676 7.033059 NA 150
## 3: GSE59449 GPL6244 6.911020 NA 141
## 4: GSE239892 GPL18573 6.878768 NA 145
## 5: GSE196274 GPL23227 6.804668 NA 144
## 6: GSE179327 GPL11154 6.496734 NA 153
Ranking by p-value usually is a bit more specific, but takes about couple of minutes to calculate.
pvalRanking <- rbindlist(bplapply(chunkPaths, function(chunkPath) {
chunk <- qs_read(chunkPath)
chunkRanking <- rbindlist(lapply(chunk, coreshMatch, query=query, calculatePvalue=TRUE))
chunkRanking
}, BPPARAM = bpparam))
pvalRanking <- pvalRanking[order(pval)]
head(pvalRanking)
## gse gpl pctVar pval size
## <char> <char> <num> <num> <int>
## 1: GSE198308 GPL24676 7.033059 8.722967e-233 150
## 2: GSE145224 GPL20301 5.942319 2.008495e-229 159
## 3: GSE153557 GPL24676 5.627913 4.722340e-229 147
## 4: GSE59449 GPL6244 6.911020 4.050704e-228 141
## 5: GSE153398 GPL11154 4.405773 2.327456e-166 144
## 6: GSE153291 GPL18573 4.320505 7.290046e-126 133
The current compendium snapshot does not provide metadata, but you can get the information from GEO:
geoUrl <- sprintf("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=%s&targ=self&form=text&view=brief", pvalRanking$gse[1])
head(readLines(geoUrl), 3)
## [1] "^SERIES = GSE198308"
## [2] "!Series_title = RNA-seq of HeLa cells stably overexpressing SRSF6-GFP grown under normoxic and hypoxic conditions"
## [3] "!Series_geo_accession = GSE198308"
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=C.utf8 LC_NUMERIC=C LC_TIME=C.utf8
## [4] LC_COLLATE=C.utf8 LC_MONETARY=C.utf8 LC_MESSAGES=C.utf8
## [7] LC_PAPER=C.utf8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.utf8 LC_IDENTIFICATION=C
##
## time zone: US/Central
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.38.0 data.table_1.16.4 fgsea_1.33.1
## [4] msigdbr_10.0.0 qs2_0.1.4 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 generics_0.1.3 tidyr_1.3.1
## [4] lattice_0.22-6 stringi_1.8.4 hms_1.1.3
## [7] digest_0.6.36 magrittr_2.0.3 evaluate_0.24.0
## [10] grid_4.4.3 bookdown_0.40 msigdbdf_24.1.0
## [13] fastmap_1.2.0 Matrix_1.7-2 rprojroot_2.0.4
## [16] jsonlite_1.8.8 BiocManager_1.30.23 purrr_1.0.2
## [19] scales_1.3.0 codetools_0.2-19 jquerylib_0.1.4
## [22] cli_3.6.3 rlang_1.1.4 munsell_0.5.1
## [25] cowplot_1.1.3 withr_3.0.2 cachem_1.1.0
## [28] yaml_2.3.9 tools_4.4.3 parallel_4.4.3
## [31] tzdb_0.4.0 dplyr_1.1.4 colorspace_2.1-1
## [34] ggplot2_3.5.1 fastmatch_1.1-6 babelgene_22.9
## [37] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
## [40] stringr_1.5.1 stringfish_0.16.0 pkgconfig_2.0.3
## [43] RcppParallel_5.1.10 pillar_1.10.0 bslib_0.7.0
## [46] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1
## [49] xfun_0.45 tibble_3.2.1 tidyselect_1.2.1
## [52] rstudioapi_0.16.0 knitr_1.48 htmltools_0.5.8.1
## [55] rmarkdown_2.27 readr_2.1.5 compiler_4.4.3