Note: the most recent version of this tutorial can be found here and a short overview slide show here.
RNA-Sequencing utilizes next-generation sequencing technologies in order to analyze the transcriptome of a biological sample under specific experimental conditions. The downstream results of this method are heavily dependent on the first step of analysis, alignment of the sequencing reads to a reference genome or transcriptome database. In this project, four RNA-Seq alignment tools (Tophat, Rsubread, HISAT, and Kallisto) were compared by running each aligner in an RNA-Seq workflow and assessing their impact on downstream results.
The more conventional methods of alignment (Tophat, Rsubread, and HISAT) displayed similar downstream results, while Kallisto (an alignment free method of quantifying abundances of transcripts) reported several discrepencies in numbers of differentially expressed genes (DEGs) reported. Despite this difference, samples clustered similarly during gene-wise cluster analysis and reported similar enrichment analysis of GO terms.
One major difference observed was time to run the aligners. Kallisto and HISAT alignment ran in significantly less time than Tophat and Rsubread. This can be an important deciding factor when choosing an aligner when the user has many samples that need to be processed.
RNA-Sequencing is a method used in transcriptome studies. It uses next-generation sequencing, which is becoming more affordable and available, to reveal the quantity of RNA in a sample at a specific moment in time. It can also be used to discover alternatively spliced transcripts, mutations, and changes in gene expression.
The first step of an RNA-seq experiment involves alignment of reads against a reference genome or a transcriptome database. This is often a time consuming step that can have very large memory requirements. Choosing an alignment software that best suites the needs of the user for downstream applications is an important consideration. In this challenge project, four RNA-Seq alignment tools (Tophat, Rsubread, HISAT, and Kallisto) were run on RNA-Seq sample data sets from Arabidposis thaliana (Howard et al. 2013) and their impact on downstream analyses, such as read counts and differentially expressed genes (DEGs) were evaluated.
The general workflow for an RNA-Seq analysis (and the workflow followed for this experiment) is as follows:
Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Gene set enrichment analysis
Clustering of gene expression profiles
This challenge project specifically evaluated different RNA-Seq aligners used in part 2 of the RNA-Seq workflow (Map reads against reference genome). The following aligners were used for this challenge project:
Tophat - first maps reads to the transcriptome, then aligns unmapped reads to the genome
HISAT - uses two different types of indexes to align reads (first, a whole-genome index to anchor each alignment to the genome followed by use of numerous local indexes for very rapid extensions of those anchored alignments)
Rsubread - uses a seed-and-vote strategy to map reads using seeds only through use of short seeds (called subreads) voting on a genomic position
Kallisto - an “alignment free” transcriptome quantification which pseudoaligns reads to a transcriptome and directly quantifies them
Typically, the user wants to record here the sources and versions of the reference genome sequence along with the corresponding annotations. In the sample data set all data inputs are stored in a data subdirectory and all results will be written to a separate results directory, while the systemPipeRNAseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory.
To run this workflow, FASTQ files and reference genome were downloaded according to the instructions provided here. The chosen data set SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013).
The following loads one of the available NGS workflow templates (here RNA-Seq) into the user’s current working directory.
library(systemPipeRdata)
genWorkenvir(workflow="rnaseq")
setwd("rnaseq")
download.file("https://raw.githubusercontent.com/tgirke/GEN242/master/vignettes/12_RNAseqWorkflow/systemPipeRNAseq.Rmd", "systemPipeRNAseq.Rmd")
Now open the R markdown script systemPipeRNAseq.Rmdin your R IDE (e.g. vim-r or RStudio) and run the workflow as outlined below.
The systemPipeR package needs to be loaded to perform the analysis steps shown in this report (Girke 2014).
library(systemPipeR)
targets fileThe targets file defines all FASTQ files and sample comparisons of the analysis workflow.
targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[,1:6]
targets
## FileName1 FileName2 SampleName Factor SampleLong Experiment
## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq M1A M1 Mock.1h.A 1
## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq M1B M1 Mock.1h.B 1
## 3 ./data/SRR446029_1.fastq ./data/SRR446029_2.fastq A1A A1 Avr.1h.A 1
## 4 ./data/SRR446030_1.fastq ./data/SRR446030_2.fastq A1B A1 Avr.1h.B 1
## 5 ./data/SRR446031_1.fastq ./data/SRR446031_2.fastq V1A V1 Vir.1h.A 1
## 6 ./data/SRR446032_1.fastq ./data/SRR446032_2.fastq V1B V1 Vir.1h.B 1
## 7 ./data/SRR446033_1.fastq ./data/SRR446033_2.fastq M6A M6 Mock.6h.A 1
## 8 ./data/SRR446034_1.fastq ./data/SRR446034_2.fastq M6B M6 Mock.6h.B 1
## 9 ./data/SRR446035_1.fastq ./data/SRR446035_2.fastq A6A A6 Avr.6h.A 1
## 10 ./data/SRR446036_1.fastq ./data/SRR446036_2.fastq A6B A6 Avr.6h.B 1
## 11 ./data/SRR446037_1.fastq ./data/SRR446037_2.fastq V6A V6 Vir.6h.A 1
## 12 ./data/SRR446038_1.fastq ./data/SRR446038_2.fastq V6B V6 Vir.6h.B 1
## 13 ./data/SRR446039_1.fastq ./data/SRR446039_2.fastq M12A M12 Mock.12h.A 1
## 14 ./data/SRR446040_1.fastq ./data/SRR446040_2.fastq M12B M12 Mock.12h.B 1
## 15 ./data/SRR446041_1.fastq ./data/SRR446041_2.fastq A12A A12 Avr.12h.A 1
## 16 ./data/SRR446042_1.fastq ./data/SRR446042_2.fastq A12B A12 Avr.12h.B 1
## 17 ./data/SRR446043_1.fastq ./data/SRR446043_2.fastq V12A V12 Vir.12h.A 1
## 18 ./data/SRR446044_1.fastq ./data/SRR446044_2.fastq V12B V12 Vir.12h.B 1
The function preprocessReads allows to apply predefined or custom read preprocessing functions to all FASTQ files referenced in a SYSargs container, such as quality filtering or adaptor trimming routines. The following example performs adaptor trimming with the trimLRPatterns function from the Biostrings package. After the trimming step a new targets file is generated (here targetsPE_trim.txt) containing the paths to the trimmed FASTQ files. The new targets file can be used for the next workflow step with an updated SYSargs instance, e.g. running the NGS alignments using the trimmed FASTQ files.
## Scaling
args <- systemArgs(sysma="param/trimPE.param", mytargets="targetsPE.txt")
preprocessReads(args=args, Fct="trimLRPatterns(Rpattern='GCCCGGGTAA', subject=fq)",
batchsize=100000, overwrite=TRUE, compress=TRUE)
writeTargetsout(x=args, file="targetsPE_trim.txt", overwrite=TRUE)
The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.
args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE.txt")
fqlist <- seeFastq(fastq=infile1(args), batchsize=100000, klength=8)
pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist))
seeFastqPlot(fqlist)
dev.off()
Code for alignments with each of the four alignment tools evaluated in this study are shown in this section:
Bowtie2/Tophat2The NGS reads of this project will be aligned against the reference genome sequence using Bowtie2/TopHat2 (Kim et al. 2013; Langmead and Salzberg 2012). The parameter settings of the aligner are defined in the tophat.param file.
args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE_trim.txt")
sysargs(args)[1] # Command-line parameters for first FASTQ file
Submission of alignment jobs to compute cluster, here using 72 CPU cores (18 qsub processes each with 4 CPU cores).
moduleload(modules(args))
system("bowtie2-build ./data/tair10.fasta ./data/tair10.fasta")
resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", cores(args)), memory="10gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01",
resourceList=resources)
waitForJobs(reg)
Check whether all BAM files have been created
file.exists(outpaths(args))
HISAT2The NGS reads of this project will be aligned against the reference genome sequence using HISAT2 (Kim, Langmead, and Salzberg 2015). The parameter settings of the aligner are defined in the hisat2PE.param file.
args <- systemArgs(sysma="param/hisat2PE.param", mytargets="targetsPE_trim.txt") # Paired end
sysargs(args)[1] # Command-line parameters for first FASTQ file
moduleload(modules(args)) # Load HISAT2 from module system
system("hisat2-build ./data/tair10.fasta ./data/tair10.fasta") #indexes reference
Submission of alignment jobs to compute cluster, here using 72 CPU cores (18 qsub processes each with 4 CPU cores).
resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", cores(args)), memory="20gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01",
resourceList=resources)
waitForJobs(reg)
Check whether all BAM files have been created
file.exists(outpaths(args))
RsubreadThe NGS reads of this project will be aligned against the reference genome sequence using Rsubread (Liao, Smyth, and Shi 2013). The parameter settings of the aligner are defined in the rsubread.param file.
library(Rsubread)
args <- systemArgs(sysma="param/rsubread.param", mytargets="targetsPE_trim.txt")
buildindex(basename=reference(args), reference=reference(args))
align(index=reference(args), readfile1=infile1(args), input_format="FASTQ",
output_file=outfile1(args), output_format="SAM", nthreads=8, indels=1, TH1=2)
for(i in seq(along=outfile1(args))) asBam(file=outfile1(args)[i], destination=gsub(".sam", "", outfile1(args)[i]), overwrite=TRUE, indexDestination=TRUE)
Run from command line)command linecd ~/bigdata/rnaseq/data
~/bin/kallisto_linux-v0.42.5/kallisto index -i Arabidopsis_transcriptome_Kallisto.idx Kallisto_Arabidopsis_thaliana.TAIR10.26.cdna.all.fa.gz
command line use the syntax kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz to align one dataset at a time or run a Shell Script batch_kallisto.sh to align all dataset together.cd ~/bigdata/rnaseq/results
~/bin/kallisto_linux-v0.42.5/kallisto quant -i ~/bigdata/rnaseq/data/Arabidopsis_transcriptome_kallisto.idx -o output -b 100 SRR446027_1.fastq_trim.gz SRR446027_2.fastq_trim.gz
or
cd ~/bigdata/rnaseq/results
bash batch_kallisto.sh
The following provides an example of overview of the number of reads in each sample and how many of them aligned to the reference.
read_statsDF <- alignStats(args=args)
write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t")
read.table(system.file("extdata", "alignStats.xls", package="systemPipeR"), header=TRUE)[1:4,]
## FileName Nreads2x Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary
## 1 M1A 192918 177961 92.24697 177961 92.24697
## 2 M1B 197484 159378 80.70426 159378 80.70426
## 3 A1A 189870 176055 92.72397 176055 92.72397
## 4 A1B 188854 147768 78.24457 147768 78.24457
summarizeOverlaps in parallel mode using multiple coresReads overlapping with annotation ranges of interest are counted for each sample using the summarizeOverlaps function (Lawrence et al. 2013). The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes. Subsequently, the expression count values are normalized by reads per kp per million mapped reads (RPKM). The raw read count table (countDFeByg.xls) and the correspoding RPKM table (rpkmDFeByg.xls) are written to separate files in the directory of this project. Parallelization is achieved with the BiocParallel package, here using 8 CPU cores.
library("GenomicFeatures"); library(BiocParallel)
txdb <- makeTxDbFromGFF(file="data/tair10.gff", format="gff", dataSource="TAIR", organism="Arabidopsis thaliana")
saveDb(txdb, file="./data/tair10.sqlite")
txdb <- loadDb("./data/tair10.sqlite")
(align <- readGAlignments(outpaths(args)[1])) # Demonstrates how to read bam file into R
eByg <- exonsBy(txdb, by=c("gene"))
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
multicoreParam <- MulticoreParam(workers=2); register(multicoreParam); registered()
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg, x, mode="Union",
ignore.strand=TRUE,
inter.feature=FALSE,
singleEnd=TRUE))
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]])); colnames(countDFeByg) <- names(bfl)
rpkmDFeByg <- apply(countDFeByg, 2, function(x) returnRPKM(counts=x, ranges=eByg))
write.table(countDFeByg, "results/countDFeByg.xls", col.names=NA, quote=FALSE, sep="\t")
write.table(rpkmDFeByg, "results/rpkmDFeByg.xls", col.names=NA, quote=FALSE, sep="\t")
Sample of data slice of count table
read.delim("results/countDFeByg.xls", row.names=1, check.names=FALSE)[1:4,1:5]
Sample of data slice of RPKM table
read.delim("results/rpkmDFeByg.xls", row.names=1, check.names=FALSE)[1:4,1:4]
Note, for most statistical differential expression or abundance analysis methods, such as edgeR or DESeq2, the raw count values should be used as input. The usage of RPKM values should be restricted to specialty applications required by some users, e.g. manually comparing the expression levels among different genes or features.
Note: The above code can be used to analyze data from Tophat, HISAT, and Rsubread aligners. The following is code for performing read counting when using Kallisto as your aligner.
Kallisto compile this step together with pseudoalignment. The abundance estimate using kallisto on the data is in the abundance.TSV file. Abundances are reported in “estimated counts” (est_counts) and in Transcripts Per Million (TPM).
Transcript-level abundance quatificaiton results from Kallisto, need to be summarizes into gene-level abundance for downstream DEG analysis.
library(tximportData)
library(tximport)
dir <- "/bigdata/gen242/yxu1/rnaseq"
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "results", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:18)
all(file.exists(files))
library(AnnotationDbi)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1] # tx ID, then gene ID
head(tx2gene)
tx2gene <- read.csv(file.path(dir, "Arabidopsistx2gene.csv"))
head(tx2gene)
library(tximport)
library(readr)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)
names(txi)
head(txi$counts)
txi.tx <- tximport(files, type = "kallisto", txOut = FALSE, tx2gene = tx2gene, reader = read_tsv)
write.csv(txi.tx, file = "abundancegenelevel.csv")
The following computes the sample-wise Spearman correlation coefficients from the rlog transformed expression values generated with the DESeq2 package. After transformation to a distance matrix, hierarchical clustering is performed with the hclust function and the result is plotted as a dendrogram (also see file sample_tree.pdf).
library(DESeq2, quietly=TRUE); library(ape, warn.conflicts=FALSE)
countDF <- as.matrix(read.table("./results/countDFeByg.xls"))
colData <- data.frame(row.names=targetsin(args)$SampleName, condition=targetsin(args)$Factor)
dds <- DESeqDataSetFromMatrix(countData = countDF, colData = colData, design = ~ condition)
d <- cor(assay(rlog(dds)), method="spearman")
hc <- hclust(dist(1-d))
pdf("results/sample_tree.pdf")
plot.phylo(as.phylo(hc), type="p", edge.col="blue", edge.width=2, show.node.label=TRUE, no.margin=TRUE)
dev.off()
The analysis of differentially expressed genes (DEGs) is performed with the glm method of the edgeR package (Robinson, McCarthy, and Smyth 2010). The sample comparisons used by this analysis are defined in the header lines of the targetsPE.txt file starting with <CMP>.
edgeRlibrary(edgeR)
countDF <- read.delim("results/countDFeByg.xls", row.names=1, check.names=FALSE)
targets <- read.delim("targetsPE_trim.txt", comment="#")
cmp <- readComp(file="targetsPE_trim.txt", format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targets, cmp=cmp[[1]], independent=FALSE, mdsplot="")
Add gene descriptions
library("biomaRt")
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
desc <- getBM(attributes=c("tair_locus", "description"), mart=m)
desc <- desc[!duplicated(desc[,1]),]
descv <- as.character(desc[,2]); names(descv) <- as.character(desc[,1])
edgeDF <- data.frame(edgeDF, Desc=descv[rownames(edgeDF)], check.names=FALSE)
write.table(edgeDF, "./results/edgeRglm_allcomp.xls", quote=FALSE, sep="\t", col.names = NA)
Filter and plot DEG results for up and down regulated genes. The definition of up and down is given in the corresponding help file. To open it, type ?filterDEGs in the R console.
edgeDF <- read.delim("results/edgeRglm_allcomp.xls", row.names=1, check.names=FALSE)
pdf("results/DEGcounts.pdf")
DEG_list <- filterDEGs(degDF=edgeDF, filter=c(Fold=2, FDR=20))
dev.off()
write.table(DEG_list$Summary, "./results/DEGcounts.xls", quote=FALSE, sep="\t", row.names=FALSE)
Note: The above code can be used to analyze data from Tophat, HISAT, and Rsubread aligners. The following is code for performing analysis of differentially expressed genes when using Kallisto as your aligner.
edgeRcountDF <- read.csv(file = "abundancecountDF.csv", row.names=1, check.names=FALSE)
targets <- read.delim("targetsPE_trim.txt", comment="#")
cmp <- readComp(file="targetsPE_trim.txt", format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targets, cmp=cmp[[1]], independent=FALSE, mdsplot="")
The overLapper function can compute Venn intersects for large numbers of sample sets (up to 20 or more) and plots 2-5 way Venn diagrams. A useful feature is the possiblity to combine the counts from several Venn comparisons with the same number of sample sets in a single Venn diagram (here for 4 up and down DEG sets).
vennsetup <- overLapper(DEG_list$Up[6:9], type="vennsets")
vennsetdown <- overLapper(DEG_list$Down[6:9], type="vennsets")
pdf("results/vennplot.pdf")
vennPlot(list(vennsetup, vennsetdown), mymain="", mysub="", colmode=2, ccol=c("blue", "red"))
dev.off()
The following shows how to obtain gene-to-GO mappings from biomaRt (here for A. thaliana) and how to organize them for the downstream GO term enrichment analysis. Alternatively, the gene-to-GO mappings can be obtained for many organisms from Bioconductorâs *.db genome annotation packages or GO annotation files provided by various genome databases. For each annotation this relatively slow preprocessing step needs to be performed only once. Subsequently, the preprocessed data can be loaded with the load function as shown in the next subsection.
library("biomaRt")
listMarts() # To choose BioMart database
listMarts(host="plants.ensembl.org")
m <- useMart("plants_mart", host="plants.ensembl.org")
listDatasets(m)
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
listAttributes(m) # Choose data types you want to download
go <- getBM(attributes=c("go_accession", "tair_locus", "go_namespace_1003"), mart=m)
go <- go[go[,3]!="",]; go[,3] <- as.character(go[,3])
go[go[,3]=="molecular_function", 3] <- "F"; go[go[,3]=="biological_process", 3] <- "P"; go[go[,3]=="cellular_component", 3] <- "C"
go[1:4,]
dir.create("./data/GO")
write.table(go, "data/GO/GOannotationsBiomart_mod.txt", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")
catdb <- makeCATdb(myfile="data/GO/GOannotationsBiomart_mod.txt", lib=NULL, org="", colno=c(1,2,3), idconv=NULL)
save(catdb, file="data/GO/catdb.RData")
Apply the enrichment analysis to the DEG sets obtained the above differential expression analysis. Note, in the following example the FDR filter is set here to an unreasonably high value, simply because of the small size of the toy data set used in this vignette. Batch enrichment analysis of many gene sets is performed with the function. When method=all, it returns all GO terms passing the p-value cutoff specified under the cutoff arguments. When method=slim, it returns only the GO terms specified under the myslimv argument. The given example shows how a GO slim vector for a specific organism can be obtained from BioMart.
library("biomaRt")
load("data/GO/catdb.RData")
DEG_list <- filterDEGs(degDF=edgeDF, filter=c(Fold=2, FDR=50), plot=FALSE)
up_down <- DEG_list$UporDown; names(up_down) <- paste(names(up_down), "_up_down", sep="")
up <- DEG_list$Up; names(up) <- paste(names(up), "_up", sep="")
down <- DEG_list$Down; names(down) <- paste(names(down), "_down", sep="")
DEGlist <- c(up_down, up, down)
DEGlist <- DEGlist[sapply(DEGlist, length) > 0]
BatchResult <- GOCluster_Report(catdb=catdb, setlist=DEGlist, method="all", id_type="gene", CLSZ=2, cutoff=0.9, gocats=c("MF", "BP", "CC"), recordSpecGO=NULL)
library("biomaRt")
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
goslimvec <- as.character(getBM(attributes=c("goslim_goa_accession"), mart=m)[,1])
BatchResultslim <- GOCluster_Report(catdb=catdb, setlist=DEGlist, method="slim", id_type="gene", myslimv=goslimvec, CLSZ=10, cutoff=0.01, gocats=c("MF", "BP", "CC"), recordSpecGO=NULL)
The data.frame generated by GOCluster can be plotted with the goBarplot function. Because of the variable size of the sample sets, it may not always be desirable to show the results from different DEG sets in the same bar plot. Plotting single sample sets is achieved by subsetting the input data frame as shown in the first line of the following example.
gos <- BatchResultslim[grep("M6-V6_up_down", BatchResultslim$CLID), ]
gos <- BatchResultslim
pdf("./results/GOslimbarplotMF.pdf", height=8, width=10); goBarplot(gos, gocat="MF"); dev.off()
The following example performs hierarchical clustering on the rlog transformed expression matrix subsetted by the DEGs identified in the above differential expression analysis. It uses a Pearson correlation-based distance measure and complete linkage for cluster joining.
library(pheatmap)
geneids <- unique(as.character(unlist(DEG_list[[1]])))
y <- assay(rlog(dds))[geneids, ]
pdf("./results/heatmap1.pdf")
pheatmap(y, scale="row", clustering_distance_rows="correlation", clustering_distance_cols="correlation")
dev.off()
In this challenge project, four different alignment tools for mapping next-generation RNA-Sequencing reads to a reference genome or transcriptome were compared. They were used in an RNA-Seq workflow and evaluated for their impact on downstream analyses, such as read counts and differentially expressed genes.
The percentage of aligned reads for each of the aligners was very similar among the alignment tools tested. There was some variability in percentages seen, but this was possibly due to technical variability between biological replicates, rather than to the alignment tools themselves.
Some noticeable differeneces were observed in the number of DEGs identified using edgeR, particularly comparing Kallisto to the more conventional aligners. The increase in number of DEGs after using this aligner may be attributed to Kallisto being a more sensitive aligner because of the nature of its direct quantification of reads against the transcriptome.
Sample-wise coorelation analysis showed that heirarchical clustering could render different results with the use of different aligners, but on the whole, most samples clustered in the same way. Similarly, this pattern was seen when expression of genes were clusted in a heatmap as well.
Overall, the alignment tools performed analogously and did not show significant impact on downstream analyses. One of the most noticable differences was the time that each aligner took to run. Kallisto and HISAT ran significantly faster than Tophat and Rsubread, which may make them the optimal choice for users that have many samples to analyze.
Girke, Thomas. 2014. “systemPipeR: NGS Workflow and Report Generation Environment.” UC Riverside. https://github.com/tgirke/systemPipeR.
Howard, Brian E, Qiwen Hu, Ahmet Can Babaoglu, Manan Chandra, Monica Borghi, Xiaoping Tan, Luyan He, et al. 2013. “High-Throughput RNA Sequencing of Pseudomonas-Infected Arabidopsis Reveals Hidden Transcriptome Complexity and Novel Splice Variants.” PLoS One 8 (10): e74183. doi:10.1371/journal.pone.0074183.
Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.
Kim, Daehwan, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biol. 14 (4): R36. doi:10.1186/gb-2013-14-4-r36.
Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4). Nature Publishing Group: 357–59. doi:10.1038/nmeth.1923.
Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118. doi:10.1371/journal.pcbi.1003118.
Liao, Yang, Gordon K Smyth, and Wei Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Res. 41 (10): e108. doi:10.1093/nar/gkt214.
Robinson, M D, D J McCarthy, and G K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. doi:10.1093/bioinformatics/btp616.