Quantitative DNA Sequencing for Chromosomal Aberrations

QDNAseq R Package

1: Introduction
1.1: Chromosomal Aberrations
1.2: WGS DNA Copy Number Estimation Methods
1.3: QDNAseq Package
2: Running QDNAseq
2.1: Bin Annotations
2.2: Processing BAM Files
2.3: Downstream Analyses
3: Sex Chromosome Processing
4: Discussion
2.3: Downstream Analyses
3: Sex Chromosome Processing

1: Introduction:

Scheinin et al. (2014) developed the QDNAseq pipeline to help improve the detection of DNA copy number aberrations from whole-genome sequencing (WGS). The focus of the study was to improve analysis and quantification of the challenges that are presented by WGS; this includes reference genome errors, sequence completions, repeat sequences, polymorphisms, variability in sample quality and procedure bias in relation to cancerous cells and hallmark characteristics.

1.1: Chromosomal Aberrations:

Chromosomal aberrations are defined as a change within either the structure or number of a chromosome. Most typical chromosomal aberrations are considered to be aneuploids; meaning they contain or are missing a number of chromosomes (e.g trisomy and monosomy) (Abhishek et al. (2018). Previous to WGS, Kallioniemi et al. (1992) pioneered alterations in chromosomes through detection of genome-wide comparative genomic hybridisation (CGH); this provided a platform for array-based CGH (Snijders et al. 2001); in addition to single nucleotide polymorphism (SNP) arrays (Ylstra et al. (2006).

1.2: WGS DNA Copy Number Estimation Methods

Prior to QDNAseq, there was four common methods to identify WGS DNA copy number variant (CNV) detection, 1. DOC (depth of coverage), PEM (paired end mapping), SR (split reads), AS (assembly based). Table 1 has been adapted from Teo et al. 2012 and explains what each of the four methods does. With the exception of AS, DOC, PEM and SR require mapping of the sequence reads to a known reference genome. Typically the methods usually compliment each other and detect certain types of variants. However, each method have unique variants that are specific to that approach. AS based methods utilize construction of a genome in sections from reads instead of aligning them with a reference. This allows them to have a greater sensitivity when detecting deviations from the reference genomes but require much higher sequence coverage of around 40 times that of other methods, due to this the costs are higher. SR and PEM both map sequence reads from either ends of genomic DNA molecules onto a reference genome. Both methods provide copy number and genome rearrangement data but are subject to higher sensitivity in regards to DNA integrity. Finally, DOC methods deduce copy number from sequence depth across the whole genome and does not require sequences from both ends (Scheinin et al. (2014).

*1.3: QDNAseq Package:**

QDNAseq was developed to implement profile correction and blacklisting, perform downstream seqmentation and calling of abberations using already established tools. QDNAseq utilises BAM input files as these are one of the more common file types produced by current alignment tools. QDNAseq is available at QDNAseq Bioconductor and has detailed information regarding operation and tutorial.

2: Running QDNAseq

The following code sequences are the basics of how to run the QDNAseq package. For the purposes of this run, the example data set used was chromosome 7-10 low grade glioma (LGG) sample; as per tutorial provided by Scheinin et al. (2014). The first step is to load the QDNAseq package.

library(QDNAseq)

2.1: Bin Annotations

The bin annotations are available through the QDNAseq.hg19 package which has to be installed from Bioconductor separate to QDNAseq. These are pre-calculated for genome build hg19 in sizes 1, 5, 10, 15, 30, 50, 100, 500 and 1000 kbp.

BiocManager::install("QDNAseq.hg19")
bins <- getBinAnnotations(binSize=15)
bins

## QDNAseq bin annotations for Hsapiens, build hg19.
## Created by Ilari Scheinin with QDNAseq 0.7.5, 2014-02-06 11:48:04.
## An object of class 'AnnotatedDataFrame'
##   rowNames: 1:1-15000 1:15001-30000 ... Y:59370001-59373566 (206391
##     total)
##   varLabels: chromosome start ... use (9 total)
##   varMetadata: labelDescription

2.2: Processing Bam Files

Next, the sequencing data from BAM files need to be loaded. This will produce an object of class QDNAseqReadCounts (see below). For multiple use of same BAM files use option cache=TRUE to cache intermediate files to speed up future analysis.

Obtaining Data

LGG150 test data was retrieved from LGG150 Data File

data(LGG150)
readCounts <- (LGG150)
readCounts

## QDNAseqReadCounts (storageMode: lockedEnvironment)
## assayData: 38819 features, 1 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: LGG150
##   varLabels: name reads used.reads expected.variance
##   varMetadata: labelDescription
## featureData
##   featureNames: 7:1-15000 7:15001-30000 ... 10:135525001-135534747
##     (38819 total)
##   fvarLabels: chromosome start ... use (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Read Count Plotting

A raw copy number profile plot is produced highlighting the bins that will be removed (highlighted Red).

plot(readCounts, logTransform=FALSE, ylim=c(-50, 200), main="Figure 1: LGG150 Read Counts per Bins", cex.main = 1, font.main = 4)

## Plotting sample Figure 1: LGG150 Read Counts per Bins (1 of 1) ...

highlightFilters(readCounts, logTransform=FALSE,
                   residual=TRUE, blacklist=TRUE)

## Highlighted 3,375 bins.

Median Read Count Plotting

Applying filters and plotting the median read counts as a function of GC content and mappability. The distribution appears less smooth than what is expected from an entire genome due to only containing a subset of chromosomes.

readCountsFiltered <- applyFilters(readCounts, residual=TRUE, blacklist=TRUE)

## 38,819   total bins
## 38,819   of which in selected chromosomes
## 36,722   of which with reference sequence
## 33,347   final bins

isobarPlot(readCountsFiltered, main="Figure 2: LGG150 Median Read Counts", cex.main = 1, font.main = 4)

## Plotting sample Figure 2: LGG150 Median Read Counts

Read Count Noise Plotting

(Below) Estimation for correction for GC content and mappability and plotting the relationship between the observed standard deviation and read depth.

readCountsFiltered <- estimateCorrection(readCountsFiltered)

## Calculating correction for GC content and mappability

##     Calculating fit for sample LGG150 (1 of 1) ...

## Done.

noisePlot(readCountsFiltered, main="Figure 3: LGG150 Read Count Relationship Between Sequence Depth and Noise", cex.main = 1, font.main = 4)

Copy Number Filtering

Applying correction for GC content and mappability. QDNAseqCopyNumbers object will be produced which is then normalized, smooth outliers and plot copy number profile.

copyNumbers <- correctBins(readCountsFiltered)

copyNumbers

## QDNAseqCopyNumbers (storageMode: lockedEnvironment)
## assayData: 38819 features, 1 samples 
##   element names: copynumber 
## protocolData: none
## phenoData
##   sampleNames: LGG150
##   varLabels: name reads ... loess.family (6 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 7:1-15000 7:15001-30000 ... 10:135525001-135534747
##     (38819 total)
##   fvarLabels: chromosome start ... use (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

copyNumbersNormalized <- normalizeBins(copyNumbers)

## Applying median normalization ...

copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)

## Smoothing outliers ...

plot(copyNumbersSmooth, main="Figure 4: LGG150 Copy number profile after correctiions", cex.main = 1, font.main = 4)

## Plotting sample Figure 4: LGG150 Copy number profile after correctiions (1 of 1) ...

Exporting Filtered Read Counts

The data is ready to be analyzed with a downstream package of choice. For external visualisation the data can be exported to specific file formats (IGV analysis below).

exportBins(copyNumbersSmooth, file="LGG150.txt")

## [1] "LGG150.txt"

exportBins(copyNumbersSmooth, file="LGG150.igv", format="igv")

## [1] "LGG150.igv"

exportBins(copyNumbersSmooth, file="LGG150.bed", format="bed")

## [1] "LGG150.bed"

2.3: Downstream Analysis

Read Count Segmenting

copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun="sqrt")

## Performing segmentation:

##     Segmenting: LGG150 (1 of 1) ...

copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)

Segmented Read Count Plot

Segmentation with the CBS algorithm from DNAcopy, and calling copy number aberrations with CGHcall or cutoffs have been implemented for convenience. By default, segmentation uses a log2 -transformation, but a sqrt(x + 3/8) can also be used as it stabilizes the variance of a Poisson distribution

plot(copyNumbersSegmented, main="Figure 5: LGG150 Copy number profile after Segmenting", cex.main = 1, font.main = 4)

## Plotting sample Figure 5: LGG150 Copy number profile after Segmenting (1 of 1) ...

Called Read Count Plot

Tune segmentation parameters and call aberrations, then final results can be plotted.

copyNumbersCalled <- callBins(copyNumbersSegmented)

## [1] "Total number of segments present in the data: 14"
## [1] "Number of segments used for fitting the model: 11"

plot(copyNumbersCalled, main="Figure 6: LGG150 Copy number profile after calling gains and losses", cex.main = 1, font.main = 4)

Called Count Extraction to VCF and SEG Files

Called data can be exported as VCF file or SEG for further downstream analysis.

exportBins(copyNumbersCalled, "copyNumbersCalled.vcf")

## [1] "copyNumbersCalled.vcf"

exportBins(copyNumbersCalled, "copyNumbersCalled.seg")

## [1] "copyNumbersCalled.seg"

CGHcall Conversion

For other downstream analyses, such as running CGHregions, conversion to a cghCall object can be useful. This command can also be used to generate cghRAW and cghSeq objects by running it before segmentation or calling.

cgh <- makeCgh(copyNumbersCalled)

cgh

## cghCall (storageMode: lockedEnvironment)
## assayData: 33347 features, 1 samples 
##   element names: calls, copynumber, probamp, probdloss, probgain, probloss, probnorm, segmented 
## protocolData: none
## phenoData
##   sampleNames: LGG150
##   varLabels: name reads ... loess.family (6 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 7:45001-60000 7:60001-75000 ... 10:135420001-135435000
##     (33347 total)
##   fvarLabels: Chromosome Start ... use (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

2.4 Parallel Comparison

Parallel Computation

QDNAseq can allow for parallel computing using the future package. In order to do this and appropriate plan must be selected. QDNAseq currently includes estimateCorrection(), segmentBins(), createBins() and calculateBlacklist() for parallel processing. It also includes binReadcounts() but this only parallelises by chromosome when chunkSize is used. The default argument method=“CGHcall” can be used for parallel computation using the function callBins() or CGHcall(). However the number of processes to use needs to be specified with the argument ncpus.

Non-Parallel Processing

The Default is to use single-core processing via “sequential” futures. This is set by:

future::plan("sequential")

Parallel Processing a Current Machine and Adhoc Machine

In order to process data in parallel using multiple processes on a current machine (see Below). after, all functions supporting parallel processing will automatically use it. However with no restrictions set, the default is to use all cores available. to set the number of parallel workers, use argument workers (example below).

future::plan("multisession")

future::plan("multisession", workers=4)

Connecting to an Adhoc machine using multiple R sessions the following code or similar should be used:

cl <- future::makeClusterPSOCK(...)
future::plan("cluster", cluster=cl)

3: Sex Chromosome Processing

QDNAseq automatically ignores sex chromosomes by default. in order for them to be included in the analysis, the function applyFilters() should be run with the argument chromosomes=NA (includes both X and Y) or chromsomes=“Y”/chromsomes=“X” to include X or Y respectively.
This would also affect the calculation of LOESS and should be counteracted by using the estimateCorrection() function. The process should be: Filter Sex Chromosomes, run estimateCorrection() and reverse the sex chromosome filtering.

Sex Chromosome QDNAseq

readCounts <- binReadCounts(getBinAnnotations(15))  
readCounts <- applyFilters(readCounts)  
readCounts <- estimateCorrection(readCounts)  
readCounts <- applyFilters(readCounts, chromosomes=NA)  
copyNumbers <- correctBins(readCounts)

QDNAseq

Michael Bell, 15002484