1 Overview

SCRAM is publicly available from https://github.com/akdess/scram

This vignette shows the basic steps for running scram.

2 Installation

Install latest version from GitHub (requires devtools package):

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("akdess/scram", dependencies = TRUE, build_vignettes = FALSE)

3 Input data

The input to scram consists of Seurat R object of raw expression matrix, clusters and cluster markers. Cluster markers can be obtained by running Seurat R package function FindAllMarkers.

Example dataset to run this this tutorial can be downloaded from [here]. example_seurat_object.rda:seurat object with clustering information example_seurat_markers.rda:seurat cluster markers, if missing can be generated running the following code

load("example_seurat_object.rda")
if(is.null(seuratObj$seurat_clusters)) meesage("seurat clusters are missing, please run FindClusters function in Seurat package")
combined.markers <- FindAllMarkers(seuratObj, only.pos = T, min.pct = 0.1, logfc.threshold = 0.5)
combined.markers <- combined.markers[combined.markers$p_val_adj<0.05,]
save("combined.markers", file="example_seurat_markers.rda")

4 Running SCRAM

4.1 Annotating Tumor Cells

Because tumor cells exhibit a wide range of transcriptional states, we employ redundant and stringent approaches to annotate tumor cells using 3 modular components: (1) marker-expression modeling, (2) genotyping of CNVs on all cells (3) RNA-inferred mutational profiling of known glioma mutations (i.e. IDH1, EGFR).

4.1.1 Tumor Marker Expression Model

Given the expressional heterogeneity of tumor markers in non-tumor cells , we used previously published datasets of tumor and non-tumor cells to establish a marker expression-based tumor classification model (i.e. thresholding requirements for “high expression” annotation) for the tumor markers PDGFRA, EGFR, CDK4, IGFBP2, IGFBP5 and SOX2. For each tumor marker gene, an independent classifier model is built using: (1) Allen Brain mouse and human scRNA-seq data, which is the largest compendium of healthy brain data, as a training set for host cells; and (2) a compendium of publicly available brain-tumor scRNA-seq datasets as a training set for tumor cells. Finally, we model the expression as a mixture of Gaussian distributions for identification and classification of non-tumor vs tumor cells

expr_features <- generate_HighExpressionFeaturesFromModel(seuratObj, 
 tumor_markers= c("PDGFRA" ,"EGFR"  , "CDK4"  , "IGFBP2" ,"SOX2"   , "IGFBP5"), k=3, prob=1)
tumors <- expr_features$tumor_markers
rownames(tumors)[1:6] <- paste0("HIGH_",rownames(tumors)[1:6] )
tumors[, which(apply(tumors, 2, sum)==1)] <- 0

4.1.2 Large Scale CNV calls in single cell resolution

CNVs are a hallmark feature of tumor cells that can be used to classify tumor vs. non-tumor cells alongside or in the absence of expression markers. However, detection of CNVs from scRNA-seq data is inherently noisy due to a multitude of factors, including drop-outs and unmatched control sets and requires a set of cells that are known to be tumor cells. To estimate a “clean” set of CNV calls that can provide reliable CNV-based tumor scores, we used a pure tumor pseudobulk sample. Estimation of CNV profiles using patient-specific pure tumor pseudobulk samples. We first use our expression-based marker model from Module 1 to identify tumor cells. The collection of cells that are assigned as “tumor” using Module 1 is treated as a pure tumor cell cohort.

CNV calling on patient-specific pure pseudobulk samples. We hypothesize that the pseudobulk sample contains representative sets of CNVs with high probability and therefore should be useful to identify a clean CNV call-set. The CNV calling on the pseudobulk samples is performed using our CNV calling algorithm, CaSpER, for each patient. CaSpER CNV calls are used as the ground truth large-scale CNV calls for each patient. Genotyping of CNVs on all cells. After CNVs are identified from the pseudobulk sample, we genotype the set of CNVs on all cells and generate a binary matrix that represents the existence of CNVs on the cells, i.e., CNV_(i,j).

4.1.3 SNV calls with XCAVTR:

4.2 Annotating Non-Tumor (Host) Cells

Glioma human and mouse cell type markers are manually curated for our scram package.

Markers can be loaded through our SCRAM package

library("scram")
data (markers_human)
data (markers_mouse)

If you would like to create your own markers, you can generate an excel file using this excel file

4.3 Summarizing co-occurring cell types using maximum frequent gene set identification

4.4 Visualization of the results

Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 2.14.0.3:

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.20.2
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.30       digest_0.6.30       R6_2.5.1           
##  [4] jsonlite_1.8.3      magrittr_2.0.3      evaluate_0.18      
##  [7] stringi_1.7.8       cachem_1.0.6        rlang_1.0.6        
## [10] cli_3.4.1           rstudioapi_0.14     jquerylib_0.1.4    
## [13] bslib_0.4.1         rmarkdown_2.18      tools_4.1.1        
## [16] stringr_1.4.1       xfun_0.34           yaml_2.3.6         
## [19] fastmap_1.1.0       compiler_4.1.1      BiocManager_1.30.19
## [22] htmltools_0.5.3     knitr_1.40          sass_0.4.2