Tutorial experiment downloadable here:
+ Six specimens for two experimental conditions
+ single-end mode sequencing,
+ 1 million reads for each sample.
Experiment description:
+ Six blood circulating exosomes miRNA samples from healthy donors (hd) and six blood circulating exosomes miRNA samples from tumor patients (tum)
+ from 1 to 3 hd and tum samples were harvested on day 1, from 5 to 6 hd and tum samples were harvested on day 2. Thus the data are require the addition of the batch effect in differential expression analysis.
The following data are available for download:
This data set allows running all the steps required to detect differentially expressed genes. The first step is the quantification of mature annotated miRNA and the generation of counts table to be used for differential expression, i.e. this section.
The covariates and batch effects can be added to the counts table (all.counts.txt), see From samples to experiment section. This table is used for differential expression genes detection.
Subsequently the overall characteristics of the dataset can be explore via PCA, Visualizing experiment data with PCA section.
It is also possible to evaluate which is the statistical power of the experiment, i.e. identifying the fraction of genes/transcripts that can be identify giving the statistical structure of the experiment, or identify the optimal number of samples required to detect differentially expressed genes. More info in Evaluating sample size and experiment power section.
Differential expression can be then evaluate using the DESeq2 module, Differential expression analysis with DESeq2.
The miRNAseq workflow can be run using 4SeqGUI graphical interface:
The miRNAseq docker container executes the following steps:
The full workflow is described in Cordero et al. Plos ONE 2012. In brief, fastq files are trimmed using cutadapt and the trimmed reads are mapped on miRNA precursors, i.e. harpin.fa file, from miRBase using SHRIMP. Using the location of the mature miRNAs in the precursor, countOverlaps function, from the Bioconductor package GenomicRanges is used to quantify the reads mapping on mature miRNAs.
All the parameters needed to run the miRNAseq workflow can be setup using 4SeqGUI
A detailed description of the parameters is given below.
The miRNAseq workflow can be also executed using R and it is completely embedded in a unique function:
#test example
system("wget 130.192.119.59/public/test.mirnaCounts.zip")
unzip("test.mirnaCounts.zip")
setwd("test.mirnaCounts")
library(docker4seq)
mirnaCounts(group="docker",fastq.folder=getwd(), scratch.folder="/data/scratch",
mirbase.id="hsa",download.status=FALSE, adapter.type="NEB", trimmed.fastq=FALSE)
User has to create the fastq.folder, where the fastq.gz files for all miRNAs under analysis are located. The scratch.folder is the location where temporary data are created. The results will be then saved in the fastq.folder. User has to provide also the identifier of the miRBase organism, e.g. hsa for Homo sapiens, mmu for Mus musculus. If the download.status is set to FALSE, mirnaCounts uses miRBase release 21, if it is set to TRUE the lastest version of precursor and mature miRNAs will be downloaded from miRBase. Users need to provide the name of the producer of the miRNA library prep kit to identify which adapters need to be provided to cutadapt, adapter.type parameter. The available adapters are NEB and Illumina, but, upon request, we can add other adapters. Finally, if the trimmed.fastq is set to FALSE the trimmed fastq are not saved at the end of the analysis.
The miRNAseq workflow produces the following output files:
+ README: A file describing the content of the data folder
+ all.counts.txt: miRNAs raw counts, to be used for differential expression analysis
+ trimmimg.log: adapters trimming statistics
+ shrimp.log: mapping statistics
+ all.counts.Rda: miRNAs raw counts ready to be loaded in R.
+ analysis.log: logs of the full analysis pipeline
4SeqGUI provides an interface to add covariates and batches to all.counts.txt
The function mirnaCovar add to the header of all.counts.txt covariates and batches or covariates only.
#test example
system("wget 130.192.119.59/public/test.mirna.analysis.zip")
unzip("test.mirna.analysis.zip")
setwd("test.mirna.analysis")
library(docker4seq)
mirnaCovar(experiment.folder=paste(getwd(), "all.counts.txt", sep="/"),
covariates=c("Cov.1", "Cov.1", "Cov.1", "Cov.1", "Cov.1", "Cov.1",
"Cov.2", "Cov.2", "Cov.2", "Cov.2", "Cov.2", "Cov.2"),
batches=c("bath.1", "bath.1", "bath.1", "bath.2", "batch.2", "batch.2",
"batch.1", "batch.1","batch.1", "batch.2","bath.2", "bath.2"), output.folder=getwd())
The output of mirnaCovar, i.e. w_covar_batch_all.counts.txt, is compliant with PCA, Sample size estimator, Experiment stat. power and DEseq2 analysis.
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component accounts for as much of the variability in the data as possible, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. 4SeqGUI provides an interface to the generation experiment samples PCA
The plot is saved in pca.pdf in the selected folder.
#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
pca(experiment.table="_log2FPKM.txt", type="FPKM", legend.position="topleft", covariatesInNames=FALSE, principal.components=c(1,2), pdf = TRUE, output.folder=getwd())
User needs to provide the paths of experiment table, experiment.table parameter, i.e. the file generated using the samples2experiment function. The type parameter indicates if FPKM, TPM or counts are used for the PCA generation. The parameter legend.position defines where to locate the covariates legend. The parameter covariatesInNames indicates if the header of the experiment table contains or not covariate information. The parameter principal.components indicates which principal components should be plotted. output.folder indicates where to save the pca.pdf file.
The values in parentesis on x and y axes are the amount of variance explained by each principal component.
IMPORTANT: The above analysis is suitable also for miRNAseq data
Sample size estimation is an important issue in the design of RNA sequencing experiments. Furthermore, experiment power provides an indication of which is the fraction of differentially expressed genes that can be detected given a specific number of samples and differential expression detection thresholds. RnaSeqSampleSize Bioconductor package provides the possibility to calculate, from a pilot experiment, the statistical power and to define the optimal sample size. We have implemented wrapper functions to RnaSeqSampleSize sample size and experiment power estimation.
4SeqGUI provides an interface to sample size estimation and to statistical power estimation.
#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
sampleSize(group="docker", filename="_counts.txt", power=0.80, FDR=0.1, genes4dispersion=200, log2fold.change=1)
The requested parameters are the path to the counts experiment table generated by samples2experiment function. The param power indicates the expected fraction of differentially expressed gene, e.g 0.80. FDR and log2fold.change are the two thresholds used to define the set of differentially expressed genes of interest.
The output file is sample_size_evaluation.txt is saved in the R working folder, below an example of the file content:
IMPORTANT: The above analysis is suitable also for miRNAseq data
#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
experimentPower(group="docker", filename="_counts.txt",replicatesXgroup=7, FDR=0.1, genes4dispersion=200, log2fold.change=1)
The requested parameters are the path to the counts experiment table generated by samples2experiment function. The param replicatesXgroup indicates the number of sample associated with each of the two covariates. FDR and log2fold.change are the two thresholds used to define the set of differentially expressed genes of interest. genes4dispersion indicates the number of genes used in the estimation of read counts and dispersion distribution.
The output file is power_estimation.txt is saved in the R working folder, below an example of the file content:
A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. 4SeqGUI provides an interface to DESeq2 to simplify differential expression analysis.
The output files are:
DEfull.txt containing the full set of results generated by DESeq2
DEfiltered_log2fc_X_fdr_Y.Y.txt containing the set of differentially expressed genes passing the indicated thresholds
log2normalized_counts.txt, log2 library size normalized counts, calculated by DESeq2, that can be used for visualization purposes.
#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
wrapperDeseq2(output.folder=getwd(), group="docker",
experiment.table="_counts.txt", log2fc=1, fdr=0.1,
ref.covar="Cov.1", type="gene", batch=FALSE)
User has to provide experiment table, experiment.table param, i.e. the counts table generated with samples2experiment function, the thresholds for the differential expression analysis, log2fc and fdr params, the reference covariate, ref.covar param, i.e. the covariate that is used as reference for differential expression detection, the type param, which refers to the type of experiment table in use: gene, isoform, mirna, batch parameter that indicates, if it is set to TRUE that the header of the experiment table also contains the extra information for the batch effect (see above).