Plates 1 and 2 have been processed with the nf-co/rnaseq pipeline.
GRCh38
Briefly, Nextflow is a workflow language. It provides a framework to link together the input and output of a “workflow” length set of software.
https://nf-co.re is a highly curated repository of bioinformatics pipelines that is actively supported and maintained by a large group of professional bioinformaticians in both the US and Europe. Critically, there are no redundancies. For example, there is only one rnaseq pipeline. The explicit goal of the organization is to produce a best-practices based, standardized pipeline that maximizes reproducibility.
If someone in Denmark wants to reproduce our results on AWS, or any number of other cluster varieties, all we have to do is tell them what genome we used, and what version of the pipeline. If they have installed Nextflow and Singularity, then they could quite literally copy/paste our command into their system (with the addition of the correct cluster scheduler flag), and the same results that we generated would be output, from the same software versions, without any more work on the user’s part.
The PCA plots show evidence of dissimilar samples. Subsetting these samples out and examining them revealed that they are reasonably considered outliers in the percent of their libraries aligned to intergenic regions. As a result, we propose that if a library is more than 8% intergenic, it be flagged for re-sequencing.
The nf-co/rnaseq QC output is compiled by MultiQC into a nice report. You may view these here:
And compare to the MGI qc output here:
Common QC metrics for RNAseq are library size, the percent of the library that uniquely maps to the genome, and rRNA as a percentage of the library size. While we might set a threshold for filtering on any one of these metrics, all of our libraries performed well.
Another common metric is produced by quantifying library complexity, or how much of the library is made up of duplicated sequences. This is complicated by the fact that we expect highly expressed sequences to have higher duplication rates. dupRadar addresses this by fitting a a curve to the duplication rate as a function of gene expression. As a result, the intercept may be used as a measure for how much of the library is duplicated at low expression levels. The higher the intercept, the higher the rate of duplication among lowly expressed reads. From this, we might surmise that the library is of low complexity.
Note: another method is picard’s estimateLibraryComplexity (not calculated by this pipeline).
note: this is equivalent to RseQC expression profiling efficiency
Percent Exonic Reads is defined as the ratio of exon mapped reads to (total) mapped reads. This metric does not account for duplicates or rRNA/ncRNA.
The plots below are density plots of the log2cpm with different filtered applied to the counts matrix in order to filter out low expression genes. Lines and colors correspond to samples.
The Mid Stringency Filter has been chosen as a preliminary filter.
31804 out of 60649 genes retained
22447 out of 60649 genes retained
15191 out of 60649 genes retained
There are a few methods to remove known batch effects from RNAseq data, and none that have been definitively chosen as superior.
One method generally employed is to include any batch parameters in the design formula (generally for DESeq, though the same method works for edgeR and limma:voom), eg ~plate+age+interesting_parameter. The effects of interesting_parameter (what is reconsistituted as the log2FoldChange) are then free of the effects which may be ascribed to plate and age. Additionally, the plate and age coefficients and a dummy variable coded model matrix may be used to remove the effects from the normalized counts, if so wished.
Another method, theoretically similar to that above, is to use a popular package called sva. This has been around for some time, and has been continually added to/improved. In 2020, a paper was published describing a method, combat_seq(), which uses the same probabilistic model as DESeq()/(default)edgeR() in order to infer and then remove a batch parameter. Of importance, the output is rounded to integers, so the data may be used by downstream applications which expect count/integer data.