10X data

Identifying CNV events

We first perform gene expression normalization for 10x data. Expression data is downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110499

library(Seurat)
library(CaSpER)

counts  <- read.delim("GSE110499_GEO_processed_MM_10X_raw_UMI_count_martix.txt", stringsAsFactor=F, header=T)
rownames(counts) <- counts[, 1] 
counts <- counts[, -1]

mm135 <- CreateSeuratObject(counts = counts, project = "mm135", min.cells = 3, min.features = 200)
mm135[["percent.mt"]] <- PercentageFeatureSet(mm135, pattern = "^MT-")
mm135 <- subset(mm135, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
mm135 <- NormalizeData(mm135 , scale.factor = 1e6, normalization.method = "RC")
mm135 <- FindVariableFeatures(mm135, do.plot = T, nfeatures = 1000)
mm135 <- ScaleData(mm135)

mm135 <- RunPCA(mm135, features = VariableFeatures(object = mm135),npcs = 100)
mm135 <- RunTSNE(mm135, dims.use = 1:10)

Some samples in this dataset are normal. So in order to seperate normal cells from MM cells we perform clustering.

mm135 <- FindNeighbors(mm135, dims = 1:10)
mm135 <- FindClusters(mm135, resolution = 0.5)

log.ge <- as.matrix(mm135@assays$RNA@data)
control <- names(Idents(mm135) )[Idents(mm135) %in% c(2,7)]
mm <- names(Idents(mm135) )[Idents(mm135) %in% c(0, 1, 3, 4)]

genes <- rownames(log.ge)
annotation <- generateAnnotation(id_type="hgnc_symbol", genes=genes, centromere=centromere, ishg19 = T)
log.ge <- log.ge[match( annotation$Gene,rownames(log.ge)) , ]
rownames(log.ge) <- annotation$Gene
log.ge <- log2(log.ge +1)

BAF values are calculated using BAFExtract and can be downloaded from this link: https://github.com/akdess/CaSpER/blob/master/data/maf.rda.

load("maf.rda")
loh<- list()
loh[[1]] <- maf
names(loh) <- "MM135"
loh.name.mapping <- data.frame (loh.name= "MM135" , sample.name=colnames(log.ge))

The main functions you will need to use to identify CNV events are CreateCasperObject() and runCaSpER(casper_object).

1. Creating casper object

The casper object is required for performing CNV analysis on single-cell and bulk RNA-Seq. Casper object is created using following functions

object <- CreateCasperObject(raw.data=log.ge,loh.name.mapping=loh.name.mapping, sequencing.type="single-cell", 
cnv.scale=3, loh.scale=3, 
expr.cutoff=0.1, 
annotation=annotation, method="iterative", loh=loh, 
control.sample.ids=control, cytoband=cytoband)

Above CaSpER object stores all information associated with the dataset, including expression data, smoothed expression data in n (default:3) different scales, original baf signal, smoothed baf signal, annotations, control sample ids.

2. Pairwise comparison of scales from BAF and expression signals

After creating the casper object we perform HMM on all scales of expression signal and integrate HMM segment states with Allele Frequency Shift Information. CaSpER algorithm outputs CNV calls using all the pairwise comparisons of expression signal and BAF signal decompositions.

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative")

3. Harmonization and Summarization of CNV calls from multiple scales and from multiple pairwise comparison of BAF and Expression Signals

The pairwise comparison and assignment of CNV calls generates a large number of per-scale information that must be summarized such that each position of the genome is assigned a final call about its CNV status, i.e., deletion/amplification/neutral. We use a consistency-based approach for harmonizing the pairwise comparisons: The events are put together and we assign the final CNV for a gene or large-scale event if the CNV calls are consistent among at least a certain number of pairwise scale comparisons. CaSpER harmonizes and summarizes the CNV calls by dividing them into large-scale, gene-based, and segment-based CNV calls

Large-Scale CNV Summarization.

We assign a large-scale CNV call to every chromosome arm for each of the N×M pairwise scale comparisons. Next, for each chromosome arm, we ask whether the large-scale CNV call is consistent among at least \(\gamma\) of the N×M large-scale CNV calls. N denotes the index for the highest smoothing scale for expression signal. M denotes the index for the highest smoothing scale for baf signal. thr represents minimum percentage, 75% (at least 7 out of 9 scales), of consistent CNV calls (Out of N×M comparisons of expression scales and BAF scales) while assigning a final CNV (amp/del/neutral) call to a segment/gene/chromosome arm.

chrMat <- extractLargeScaleEvents (final.objects, thr=0.75) 

chrMat <- finalChrMat
plot.data <- melt(chrMat)
plot.data$value2 <- "neutral"
plot.data$value2[plot.data$value > 0] <- "amplification"
plot.data$value2[plot.data$value < 0] <- "deletion"
plot.data$value2 <- factor(plot.data$value2, levels = c("amplification", 
    "deletion", "neutral"))
plot.data$X2 <- factor(plot.data$X2, levels = colnames(chrMat))
p <- ggplot(aes(x = X2, y = X1, fill = value2), data = plot.data) + 
    geom_tile(colour = "white", size = 0.01) + 
    labs(x = "", 
    y = "") + scale_fill_manual(values = c(amplification = muted("red"), 
    deletion = muted("blue"), neutral = "white")) + theme_grey(base_size = 6) + 
    theme(legend.position = "right", legend.direction = "vertical", 
        legend.title = element_blank(), strip.text.x = element_blank(), 
        legend.text = element_text(colour = "black", size = 7, 
            face = "bold"), legend.key.height = grid::unit(0.8, 
            "cm"), legend.key.width = grid::unit(0.5, "cm"), 
        axis.text.x = element_text(size = 5, colour = "black", 
            angle = -45, hjust = 0), axis.text.y = element_text(size = 6, 
            vjust = 0.2, colour = "black"), axis.ticks = element_line(size = 0.4), 
        plot.title = element_text(colour = "black", hjust = 0, 
            size = 6, face = "bold"))

10X data

September 23, 2019