TCGA GBM Data

Identifying CNV events

The expression values should be normalized using TPM or FPKM. Below is an example normalized gene expression input :

library(CaSpER)
#Download rdata from: https://www.dropbox.com/s/thx3jt589kps886/tcga_gbm.rda?dl=0
load("tcga_gbm.rda")
data <- tcga_gbm$data
loh <-  tcga_gbm$loh
loh.name.mapping <-  tcga_gbm$loh.name.mapping
control.sample.ids <-  tcga_gbm$control.sample.ids
genoMat <-  tcga_gbm$genoMat

Normalized gene expression matrix rows represents genes either with ensembl gene id or hgnc gene symbol. Columns represent samples or cells.

Cytoband information can be downloaded from UCSC.

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz

data(hg19_cytoband)

Centromere information can be downloaded from UCSC.

#curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz" | gunzip -c | grep acen | head

Annotation data which contains the positions of each gene along each chromosome in the genome can be generated using the following functions:

annotation <- generateAnnotation(id_type="hgnc_symbol", genes=rownames(data), ishg19=T, centromere)
data <- data[match( annotation$Gene,rownames(data)), ]

BAFExtract output files are read using the following functions:

loh <- readBAFExtractOutput ( path="./tcga_gbm_baf\\", sequencing.type="bulk")
names(loh) <- gsub(".snp", "", names(loh))

Identifying CNV events:

The main functions you will need to use to identify CNV events are CreateCasperObject() and runCaSpER(casper_object).

1. Creating casper object

The casper object is required for performing CNV analysis on single-cell and bulk RNA-Seq. Casper object is created using following functions

object <- CreateCasperObject(raw.data=data, loh.name.mapping=loh.name.mapping, 
    sequencing.type="bulk", cnv.scale=3, loh.scale=3,method="iterative",
              annotation=annotation, loh=loh, 
              control.sample.ids=control.sample.ids, cytoband=cytoband)

Above CaSpER object stores all information associated with the dataset, including expression data, smoothed expression data in n (default:3) different scales, original baf signal, smoothed baf signal, annotations, control sample ids.

loh.name.mapping: contains the sample id mappings between expression signal and baf signal. For bulk sequencing data there will be one to one mapping since baf signal is generated for all samples seperately.

2. Pairwise comparison of scales from BAF and expression signals

After creating the casper object we perform HMM on all scales of expression signal and integrate HMM segment states with Allele Frequency Shift Information. CaSpER algorithm outputs CNV calls using all the pairwise comparisons of expression signal and BAF signal decompositions.

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative")

3. Harmonization and Summarization of CNV calls from multiple scales and from multiple pairwise comparison of BAF and Expression Signals

The pairwise comparison and assignment of CNV calls generates a large number of per-scale information that must be summarized such that each position of the genome is assigned a final call about its CNV status, i.e., deletion/amplification/neutral. We use a consistency-based approach for harmonizing the pairwise comparisons: The events are put together and we assign the final CNV for a gene or large-scale event if the CNV calls are consistent among at least a certain number of pairwise scale comparisons. CaSpER harmonizes and summarizes the CNV calls by dividing them into large-scale, gene-based, and segment-based CNV calls

Large-Scale CNV Summarization.

We assign a large-scale CNV call to every chromosome arm for each of the N×M pairwise scale comparisons. Next, for each chromosome arm, we ask whether the large-scale CNV call is consistent among at least ?? of the N×M large-scale CNV calls. N denotes the index for the highest smoothing scale for expression signal. M denotes the index for the highest smoothing scale for baf signal. thr represents minimum percentage, 75% (at least 7 out of 9 scales), of consistent CNV calls (Out of N×M comparisons of expression scales and BAF scales) while assigning a final CNV (amp/del/neutral) call to a segment/gene/chromosome arm.

finalChrMat <- extractLargeScaleEvents (final.objects, thr=0.75)

Segment based CNV Summarization.

The segments-based summarization aims at generating a final set of CNV calls for a final set of segments that are computed by comparison of scales. We first compare the segments from different expression scales and generate the consistent set of segments. For each segment in the final set, if there are more than \(\gamma\) (default=6) consistent CNV calls among N×M CNV calls, we assign the consistent CNV call to segment. When there is no consistency among the calls, we assign a neutral CNV state to segment.

gamma <- 7
all.segments <- do.call(rbind, lapply(final.objects, function(x) x@segments))
segment.summary <- extractSegmentSummary (final.objects)
loss <- segment.summary$all.summary.loss
gain <- segment.summary$all.summary.gain
loss.final <- loss[loss$count>=gamma, ]
gain.final <- gain[gain$count>=gamma, ]

Calculate large scale event accuracy using snp genotyping array as gold standard.

genoMat <-  tcga_gbm$genoMat
common <- intersect(rownames(finalChrMat), rownames(genoMat))
finalChrMat <- finalChrMat[match(common, rownames(finalChrMat)), ]
genoMat <- genoMat[match(common, rownames(genoMat)), ]
calcROC(chrMat=finalChrMat, chrMat2=genoMat)

Gene based CNV Summarization.

Similar to the large-scale summarization, we generate a matrix where rows are the samples (cells) and columns are the genes. The matrix entry of 0 corresponds to no alteration, 1 corresponds to amplification and -1 corresponds to deletion. If an alteration is consistent in more than \(\gamma\) scale comparisons (out of N×M comparisons), we report that alteration event for that sample.

all.summary<- rbind(loss.final, gain.final)
colnames(all.summary) [2:4] <- c("Chromosome", "Start",   "End")
rna <-  GRanges(seqnames = Rle(gsub("q", "", gsub("p", "", all.summary$Chromosome))), 
    IRanges(all.summary$Start, all.summary$End))   
ann.gr <- makeGRangesFromDataFrame(final.objects[[1]]@annotation.filt, keep.extra.columns = TRUE, seqnames.field="Chr")
hits <- findOverlaps(geno.rna, ann.gr)
genes <- splitByOverlap(ann.gr, geno.rna, "GeneSymbol")
genes.ann <- lapply(genes, function(x) x[!(x=="")])
all.genes <- unique(final.objects[[1]]@annotation.filt[,2])
all.samples <- unique(as.character(final.objects[[1]]@segments$ID))
rna.matrix <- gene.matrix(seg=all.summary, all.genes=all.genes, all.samples=all.samples, genes.ann=genes.ann)

Calculate gene based TPR and FPR using genotyping array as gold standard

segments <- yale_meningioma$segments
segments$type<- segments$Type_Corrected
geno.gr <-  GRanges(seqnames = Rle(gsub("q", "", gsub("p", "", segments$chr))), IRanges(segments$start, segments$end))   
hits <- findOverlaps(geno.gr, ann.gr)
genes <- splitByOverlap(ann.gr, geno.gr, "GeneSymbol")
genes.ann <- lapply(genes, function(x) x[!(x=="")])
gt.matrix <- gene.matrix(seg=segments, all.samples=all.samples, all.genes=all.genes, genes.ann=genes.ann)
calcROC(chrMat=rna.matrix, chrMat2=gt.matrix)

4.Visualization

In order to comprehensively visualize the CNV events, CaSpER first generates visuals that combine all the samples so that user can view the events jointly over all samples and co-occurrence and mutually exclusive patterns can be visually inspected. To characterize these patterns, CaSpER performs clustering of samples and clustering of detected CNVs and generates clustering plots. We summarize these visualizations below.

plotHeatmap: Visualization of the genomewide gene expression signal plot at different smoothing scales CaSpER outputs the expression signal at different scales. In these plots, each row is a sample and the columns are the chromosomes. These can be useful for comparison of outputs from different scales using the panoramic inspection of the expression signal.

obj <- final.objects[[9]]
plotHeatmap(object=obj, fileName="heatmap.png",cnv.scale= 3, cluster_cols = F, cluster_rows = T, show_rownames = T, only_soi = T)

plotLargeScaleEvent: Visualization of the large-scale CNV events among all the samples/cells. Large scale event summarization is useful for summarizing the detected large-scale CNV events (deletions and amplifications) over multiple samples. This plot summarizes the large scale CNVs and may reveal the patterns that may otherwise be missed when data is visualized at smaller scales.

plotLargeScaleEvent (object=obj, fileName="large.scale.events.png")

plotGEAndBAFOneSample: Gene expression and BAF signal for one sample in one plot. This option generates the visualization of the gene expression signal and BAF shift signal together so that the user can jointly assess them in one page.

plotGEAndBAFOneSample (object=obj, cnv.scale=3, loh.scale=3, sample= "TCGA-02-0047-01A")

TCGA GBM Data

June 17, 2019