Identifying CNV events
We first perform gene expression normalization for 10x data. Expression data is downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110499
library(Seurat)
library(CaSpER)
counts <- read.delim("GSE110499_GEO_processed_MM_10X_raw_UMI_count_martix.txt", stringsAsFactor=F, header=T)
rownames(counts) <- counts[, 1]
counts <- counts[, -1]
mm135 <- CreateSeuratObject(counts = counts, project = "mm135", min.cells = 3, min.features = 200)
mm135[["percent.mt"]] <- PercentageFeatureSet(mm135, pattern = "^MT-")
mm135 <- subset(mm135, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
mm135 <- NormalizeData(mm135 , scale.factor = 1e6, normalization.method = "RC")
mm135 <- FindVariableFeatures(mm135, do.plot = T, nfeatures = 1000)
mm135 <- ScaleData(mm135)
mm135 <- RunPCA(mm135, features = VariableFeatures(object = mm135),npcs = 100)
mm135 <- RunTSNE(mm135, dims.use = 1:10)
Some samples in this dataset are normal. So in order to seperate normal cells from MM cells we perform clustering.
mm135 <- FindNeighbors(mm135, dims = 1:10)
mm135 <- FindClusters(mm135, resolution = 0.5)
log.ge <- as.matrix(mm135@assays$RNA@data)
control <- names(Idents(mm135) )[Idents(mm135) %in% c(2,7)]
mm <- names(Idents(mm135) )[Idents(mm135) %in% c(0, 1, 3, 4)]
genes <- rownames(log.ge)
annotation <- generateAnnotation(id_type="hgnc_symbol", genes=genes, centromere=centromere, ishg19 = T)
log.ge <- log.ge[match( annotation$Gene,rownames(log.ge)) , ]
rownames(log.ge) <- annotation$Gene
log.ge <- log2(log.ge +1)
BAF values are calculated using BAFExtract and can be downloaded from this link: https://github.com/akdess/CaSpER/blob/master/data/maf.rda.
load("maf.rda")
loh<- list()
loh[[1]] <- maf
names(loh) <- "MM135"
loh.name.mapping <- data.frame (loh.name= "MM135" , sample.name=colnames(log.ge))
The main functions you will need to use to identify CNV events are CreateCasperObject() and runCaSpER(casper_object).
1. Creating casper object
The casper object is required for performing CNV analysis on single-cell and bulk RNA-Seq. Casper object is created using following functions
object <- CreateCasperObject(raw.data=log.ge,loh.name.mapping=loh.name.mapping, sequencing.type="single-cell",
cnv.scale=3, loh.scale=3,
expr.cutoff=0.1,
annotation=annotation, method="iterative", loh=loh,
control.sample.ids=control, cytoband=cytoband)
Above CaSpER object stores all information associated with the dataset, including expression data, smoothed expression data in n (default:3) different scales, original baf signal, smoothed baf signal, annotations, control sample ids.
2. Pairwise comparison of scales from BAF and expression signals
After creating the casper object we perform HMM on all scales of expression signal and integrate HMM segment states with Allele Frequency Shift Information. CaSpER algorithm outputs CNV calls using all the pairwise comparisons of expression signal and BAF signal decompositions.
final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative")
3. Harmonization and Summarization of CNV calls from multiple scales and from multiple pairwise comparison of BAF and Expression Signals
The pairwise comparison and assignment of CNV calls generates a large number of per-scale information that must be summarized such that each position of the genome is assigned a final call about its CNV status, i.e., deletion/amplification/neutral. We use a consistency-based approach for harmonizing the pairwise comparisons: The events are put together and we assign the final CNV for a gene or large-scale event if the CNV calls are consistent among at least a certain number of pairwise scale comparisons. CaSpER harmonizes and summarizes the CNV calls by dividing them into large-scale, gene-based, and segment-based CNV calls
We assign a large-scale CNV call to every chromosome arm for each of the N×M pairwise scale comparisons. Next, for each chromosome arm, we ask whether the large-scale CNV call is consistent among at least \(\gamma\) of the N×M large-scale CNV calls. N denotes the index for the highest smoothing scale for expression signal. M denotes the index for the highest smoothing scale for baf signal. thr represents minimum percentage, 75% (at least 7 out of 9 scales), of consistent CNV calls (Out of N×M comparisons of expression scales and BAF scales) while assigning a final CNV (amp/del/neutral) call to a segment/gene/chromosome arm.
chrMat <- extractLargeScaleEvents (final.objects, thr=0.75)
chrMat <- finalChrMat
plot.data <- melt(chrMat)
plot.data$value2 <- "neutral"
plot.data$value2[plot.data$value > 0] <- "amplification"
plot.data$value2[plot.data$value < 0] <- "deletion"
plot.data$value2 <- factor(plot.data$value2, levels = c("amplification",
"deletion", "neutral"))
plot.data$X2 <- factor(plot.data$X2, levels = colnames(chrMat))
p <- ggplot(aes(x = X2, y = X1, fill = value2), data = plot.data) +
geom_tile(colour = "white", size = 0.01) +
labs(x = "",
y = "") + scale_fill_manual(values = c(amplification = muted("red"),
deletion = muted("blue"), neutral = "white")) + theme_grey(base_size = 6) +
theme(legend.position = "right", legend.direction = "vertical",
legend.title = element_blank(), strip.text.x = element_blank(),
legend.text = element_text(colour = "black", size = 7,
face = "bold"), legend.key.height = grid::unit(0.8,
"cm"), legend.key.width = grid::unit(0.5, "cm"),
axis.text.x = element_text(size = 5, colour = "black",
angle = -45, hjust = 0), axis.text.y = element_text(size = 6,
vjust = 0.2, colour = "black"), axis.ticks = element_line(size = 0.4),
plot.title = element_text(colour = "black", hjust = 0,
size = 6, face = "bold"))