Initial LLFS RNAseq QC

Summary

Processing Summary

Processing Pipeline

Plates 1 and 2 have been processed with the nf-co/rnaseq pipeline.

Genome

GRCh38

Annotations:

Gencode Release 38

note: MGI used gencode release 35. The most current is 38.

Why did you use the nf-co/rnaseq pipeline

Briefly, Nextflow is a workflow language. It provides a framework to link together the input and output of a “workflow” length set of software.

https://nf-co.re is a highly curated repository of bioinformatics pipelines that is actively supported and maintained by a large group of professional bioinformaticians in both the US and Europe. Critically, there are no redundancies. For example, there is only one rnaseq pipeline. The explicit goal of the organization is to produce a best-practices based, standardized pipeline that maximizes reproducibility.

Benefit of nf-co vs other options

If someone in Denmark wants to reproduce our results on AWS, or any number of other cluster varieties, all we have to do is tell them what genome we used, and what version of the pipeline. If they have installed Nextflow and Singularity, then they could quite literally copy/paste our command into their system (with the addition of the correct cluster scheduler flag), and the same results that we generated would be output, from the same software versions, without any more work on the user’s part.

QC Summary

The PCA plots show evidence of dissimilar samples. Subsetting these samples out and examining them revealed that they are reasonably considered outliers in the percent of their libraries aligned to intergenic regions. As a result, we propose that if a library is more than 8% intergenic, it be flagged for re-sequencing.

Current thresholds

Samples violating these thresholds

Justification

QC

Detailed QC metrics by plate

The nf-co/rnaseq QC output is compiled by MultiQC into a nice report. You may view these here:

plate1

plate2

And compare to the MGI qc output here:

MGI plate1

MGI plate2

Alignment metrics

Common QC metrics for RNAseq are library size, the percent of the library that uniquely maps to the genome, and rRNA as a percentage of the library size. While we might set a threshold for filtering on any one of these metrics, all of our libraries performed well.

Unique Alignment

rRNA percent

Intergenic Percent

Complexity metrics

dupRadar Intercept

Another common metric is produced by quantifying library complexity, or how much of the library is made up of duplicated sequences. This is complicated by the fact that we expect highly expressed sequences to have higher duplication rates. dupRadar addresses this by fitting a a curve to the duplication rate as a function of gene expression. As a result, the intercept may be used as a measure for how much of the library is duplicated at low expression levels. The higher the intercept, the higher the rate of duplication among lowly expressed reads. From this, we might surmise that the library is of low complexity.

Note: another method is picard’s estimateLibraryComplexity (not calculated by this pipeline).

Percent Exonic Reads

note: this is equivalent to RseQC expression profiling efficiency

Percent Exonic Reads is defined as the ratio of exon mapped reads to (total) mapped reads. This metric does not account for duplicates or rRNA/ncRNA.

Scatter plots

Count Filtering

The plots below are density plots of the log2cpm with different filtered applied to the counts matrix in order to filter out low expression genes. Lines and colors correspond to samples.

The Mid Stringency Filter has been chosen as a preliminary filter.

Unfiltered counts

Minimal filter

Filter parameter: more than 1 cpm in at least 2 samples

31804 out of 60649 genes retained

Mid stringency filter

Filter parameter: more than 2 cpm in at least 4 samples

22447 out of 60649 genes retained

Stringent filter

Filter parameter: more than 4 cpm in at least 8 samples

15191 out of 60649 genes retained

Sample Agreement

plate1

PCA – Unfiltered, top 500 most variable genes

PCA – Unfiltered, all genes

PCA – filtered, top 500 most variable genes

PCA – filtered, all genes

plate2

PCA – Unfiltered, top 500 most variable genes

PCA – Unfiltered, all genes

PCA – filtered, top 500 most variable genes

PCA – filtered, all genes

combined

PCA – Unfiltered, top 500 most variable genes

PCA – Unfiltered, all genes

PCA – filtered, top 500 most variable genes

PCA – filtered, all genes

Batch

There are a few methods to remove known batch effects from RNAseq data, and none that have been definitively chosen as superior.

One method generally employed is to include any batch parameters in the design formula (generally for DESeq, though the same method works for edgeR and limma:voom), eg ~plate+age+interesting_parameter. The effects of interesting_parameter (what is reconsistituted as the log2FoldChange) are then free of the effects which may be ascribed to plate and age. Additionally, the plate and age coefficients and a dummy variable coded model matrix may be used to remove the effects from the normalized counts, if so wished.

Another method, theoretically similar to that above, is to use a popular package called sva. This has been around for some time, and has been continually added to/improved. In 2020, a paper was published describing a method, combat_seq(), which uses the same probabilistic model as DESeq()/(default)edgeR() in order to infer and then remove a batch parameter. Of importance, the output is rounded to integers, so the data may be used by downstream applications which expect count/integer data.