Note: the most recent version of this tutorial can be found here and a short overview slide show here.

1 Abstract

RNA-Sequencing utilizes next-generation sequencing technologies in order to analyze the transcriptome of a biological sample under specific experimental conditions. The downstream results of this method are heavily dependent on the first step of analysis, alignment of the sequencing reads to a reference genome or transcriptome database. In this project, four RNA-Seq alignment tools (Tophat, Rsubread, HISAT, and Kallisto) were compared by running each aligner in an RNA-Seq workflow and assessing their impact on downstream results.

The more conventional methods of alignment (Tophat, Rsubread, and HISAT) displayed similar downstream results, while Kallisto (an alignment free method of quantifying abundances of transcripts) reported several discrepencies in numbers of differentially expressed genes (DEGs) reported. Despite this difference, samples clustered similarly during gene-wise cluster analysis and reported similar enrichment analysis of GO terms.

One major difference observed was time to run the aligners. Kallisto and HISAT alignment ran in significantly less time than Tophat and Rsubread. This can be an important deciding factor when choosing an aligner when the user has many samples that need to be processed.

4 Results and Discussion

4.1 Sample definitions and environment settings

4.1.1 Environment settings and input data

Typically, the user wants to record here the sources and versions of the reference genome sequence along with the corresponding annotations. In the sample data set all data inputs are stored in a data subdirectory and all results will be written to a separate results directory, while the systemPipeRNAseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory.

To run this workflow, FASTQ files and reference genome were downloaded according to the instructions provided here. The chosen data set SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013).

The following loads one of the available NGS workflow templates (here RNA-Seq) into the user’s current working directory.

library(systemPipeRdata)
genWorkenvir(workflow="rnaseq")
setwd("rnaseq")
download.file("https://raw.githubusercontent.com/tgirke/GEN242/master/vignettes/12_RNAseqWorkflow/systemPipeRNAseq.Rmd", "systemPipeRNAseq.Rmd")

Now open the R markdown script systemPipeRNAseq.Rmdin your R IDE (e.g. vim-r or RStudio) and run the workflow as outlined below.

4.1.2 Required packages and resources

The systemPipeR package needs to be loaded to perform the analysis steps shown in this report (Girke 2014).

library(systemPipeR)

4.1.3 Experiment definition provided by `targets` file

The targets file defines all FASTQ files and sample comparisons of the analysis workflow.

targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[,1:6]
targets

##                   FileName1                FileName2 SampleName Factor SampleLong Experiment
## 1  ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq        M1A     M1  Mock.1h.A          1
## 2  ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq        M1B     M1  Mock.1h.B          1
## 3  ./data/SRR446029_1.fastq ./data/SRR446029_2.fastq        A1A     A1   Avr.1h.A          1
## 4  ./data/SRR446030_1.fastq ./data/SRR446030_2.fastq        A1B     A1   Avr.1h.B          1
## 5  ./data/SRR446031_1.fastq ./data/SRR446031_2.fastq        V1A     V1   Vir.1h.A          1
## 6  ./data/SRR446032_1.fastq ./data/SRR446032_2.fastq        V1B     V1   Vir.1h.B          1
## 7  ./data/SRR446033_1.fastq ./data/SRR446033_2.fastq        M6A     M6  Mock.6h.A          1
## 8  ./data/SRR446034_1.fastq ./data/SRR446034_2.fastq        M6B     M6  Mock.6h.B          1
## 9  ./data/SRR446035_1.fastq ./data/SRR446035_2.fastq        A6A     A6   Avr.6h.A          1
## 10 ./data/SRR446036_1.fastq ./data/SRR446036_2.fastq        A6B     A6   Avr.6h.B          1
## 11 ./data/SRR446037_1.fastq ./data/SRR446037_2.fastq        V6A     V6   Vir.6h.A          1
## 12 ./data/SRR446038_1.fastq ./data/SRR446038_2.fastq        V6B     V6   Vir.6h.B          1
## 13 ./data/SRR446039_1.fastq ./data/SRR446039_2.fastq       M12A    M12 Mock.12h.A          1
## 14 ./data/SRR446040_1.fastq ./data/SRR446040_2.fastq       M12B    M12 Mock.12h.B          1
## 15 ./data/SRR446041_1.fastq ./data/SRR446041_2.fastq       A12A    A12  Avr.12h.A          1
## 16 ./data/SRR446042_1.fastq ./data/SRR446042_2.fastq       A12B    A12  Avr.12h.B          1
## 17 ./data/SRR446043_1.fastq ./data/SRR446043_2.fastq       V12A    V12  Vir.12h.A          1
## 18 ./data/SRR446044_1.fastq ./data/SRR446044_2.fastq       V12B    V12  Vir.12h.B          1

4.2 Read preprocessing

4.2.1 Read quality filtering and trimming

The function preprocessReads allows to apply predefined or custom read preprocessing functions to all FASTQ files referenced in a SYSargs container, such as quality filtering or adaptor trimming routines. The following example performs adaptor trimming with the trimLRPatterns function from the Biostrings package. After the trimming step a new targets file is generated (here targetsPE_trim.txt) containing the paths to the trimmed FASTQ files. The new targets file can be used for the next workflow step with an updated SYSargs instance, e.g. running the NGS alignments using the trimmed FASTQ files.

## Scaling
args <- systemArgs(sysma="param/trimPE.param", mytargets="targetsPE.txt")
preprocessReads(args=args, Fct="trimLRPatterns(Rpattern='GCCCGGGTAA', subject=fq)",
                batchsize=100000, overwrite=TRUE, compress=TRUE)
writeTargetsout(x=args, file="targetsPE_trim.txt", overwrite=TRUE)

4.2.2 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.

args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE.txt")
fqlist <- seeFastq(fastq=infile1(args), batchsize=100000, klength=8)
pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist))
seeFastqPlot(fqlist)
dev.off()

Figure 1: FASTQ quality report for 18 samples

4.3 Alignments

Code for alignments with each of the four alignment tools evaluated in this study are shown in this section:

4.3.1 Read mapping with `Bowtie2/Tophat2`

The NGS reads of this project will be aligned against the reference genome sequence using Bowtie2/TopHat2 (Kim et al. 2013; Langmead and Salzberg 2012). The parameter settings of the aligner are defined in the tophat.param file.

args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE_trim.txt")
sysargs(args)[1] # Command-line parameters for first FASTQ file

Submission of alignment jobs to compute cluster, here using 72 CPU cores (18 qsub processes each with 4 CPU cores).

moduleload(modules(args))
system("bowtie2-build ./data/tair10.fasta ./data/tair10.fasta")
resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", cores(args)), memory="10gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01", 
                  resourceList=resources)
waitForJobs(reg)

Check whether all BAM files have been created

file.exists(outpaths(args))

4.3.2 Read mapping with `HISAT2`

The NGS reads of this project will be aligned against the reference genome sequence using HISAT2 (Kim, Langmead, and Salzberg 2015). The parameter settings of the aligner are defined in the hisat2PE.param file.

args <- systemArgs(sysma="param/hisat2PE.param", mytargets="targetsPE_trim.txt") # Paired end
sysargs(args)[1] # Command-line parameters for first FASTQ file
moduleload(modules(args)) # Load HISAT2 from module system
system("hisat2-build ./data/tair10.fasta ./data/tair10.fasta") #indexes reference

Submission of alignment jobs to compute cluster, here using 72 CPU cores (18 qsub processes each with 4 CPU cores).

resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", cores(args)), memory="20gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01", 
                  resourceList=resources)
waitForJobs(reg)

Check whether all BAM files have been created

file.exists(outpaths(args))

4.3.3 Read mapping with `Rsubread`

The NGS reads of this project will be aligned against the reference genome sequence using Rsubread (Liao, Smyth, and Shi 2013). The parameter settings of the aligner are defined in the rsubread.param file.

library(Rsubread)
args <- systemArgs(sysma="param/rsubread.param", mytargets="targetsPE_trim.txt")
buildindex(basename=reference(args), reference=reference(args)) 
align(index=reference(args), readfile1=infile1(args), input_format="FASTQ", 
      output_file=outfile1(args), output_format="SAM", nthreads=8, indels=1, TH1=2)
for(i in seq(along=outfile1(args))) asBam(file=outfile1(args)[i], destination=gsub(".sam", "", outfile1(args)[i]), overwrite=TRUE, indexDestination=TRUE)

4.3.4 Kallisto (`Run from command line`)

4.3.4.1 Transcriptome index, run in `command line`

cd ~/bigdata/rnaseq/data
~/bin/kallisto_linux-v0.42.5/kallisto index -i Arabidopsis_transcriptome_Kallisto.idx Kallisto_Arabidopsis_thaliana.TAIR10.26.cdna.all.fa.gz

4.3.4.2 Reads pseudoalignment and quantification, run in `command line` use the syntax `kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz` to align one dataset at a time or run a Shell Script `batch_kallisto.sh` to align all dataset together.

cd ~/bigdata/rnaseq/results
~/bin/kallisto_linux-v0.42.5/kallisto quant -i ~/bigdata/rnaseq/data/Arabidopsis_transcriptome_kallisto.idx -o output -b 100 SRR446027_1.fastq_trim.gz SRR446027_2.fastq_trim.gz

cd ~/bigdata/rnaseq/results
bash batch_kallisto.sh

4.3.5 Read and alignment stats

The following provides an example of overview of the number of reads in each sample and how many of them aligned to the reference.

read_statsDF <- alignStats(args=args) 
write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t")

read.table(system.file("extdata", "alignStats.xls", package="systemPipeR"), header=TRUE)[1:4,]

##   FileName Nreads2x Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary
## 1      M1A   192918 177961     92.24697         177961             92.24697
## 2      M1B   197484 159378     80.70426         159378             80.70426
## 3      A1A   189870 176055     92.72397         176055             92.72397
## 4      A1B   188854 147768     78.24457         147768             78.24457

Figure 2: Comparison of Alignment Statistics of Four Different Alignment Tools

4.4 Read quantification per annotation range

4.4.1 Read counting with `summarizeOverlaps` in parallel mode using multiple cores

Reads overlapping with annotation ranges of interest are counted for each sample using the summarizeOverlaps function (Lawrence et al. 2013). The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes. Subsequently, the expression count values are normalized by reads per kp per million mapped reads (RPKM). The raw read count table (countDFeByg.xls) and the correspoding RPKM table (rpkmDFeByg.xls) are written to separate files in the directory of this project. Parallelization is achieved with the BiocParallel package, here using 8 CPU cores.

library("GenomicFeatures"); library(BiocParallel)
txdb <- makeTxDbFromGFF(file="data/tair10.gff", format="gff", dataSource="TAIR", organism="Arabidopsis thaliana")
saveDb(txdb, file="./data/tair10.sqlite")
txdb <- loadDb("./data/tair10.sqlite")
(align <- readGAlignments(outpaths(args)[1])) # Demonstrates how to read bam file into R
eByg <- exonsBy(txdb, by=c("gene"))
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
multicoreParam <- MulticoreParam(workers=2); register(multicoreParam); registered()
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg, x, mode="Union", 
                                               ignore.strand=TRUE, 
                                               inter.feature=FALSE, 
                                               singleEnd=TRUE)) 
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]])); colnames(countDFeByg) <- names(bfl)
rpkmDFeByg <- apply(countDFeByg, 2, function(x) returnRPKM(counts=x, ranges=eByg))
write.table(countDFeByg, "results/countDFeByg.xls", col.names=NA, quote=FALSE, sep="\t")
write.table(rpkmDFeByg, "results/rpkmDFeByg.xls", col.names=NA, quote=FALSE, sep="\t")

Sample of data slice of count table

read.delim("results/countDFeByg.xls", row.names=1, check.names=FALSE)[1:4,1:5]

Sample of data slice of RPKM table

read.delim("results/rpkmDFeByg.xls", row.names=1, check.names=FALSE)[1:4,1:4]

Note, for most statistical differential expression or abundance analysis methods, such as edgeR or DESeq2, the raw count values should be used as input. The usage of RPKM values should be restricted to specialty applications required by some users, e.g. manually comparing the expression levels among different genes or features.

Note: The above code can be used to analyze data from Tophat, HISAT, and Rsubread aligners. The following is code for performing read counting when using Kallisto as your aligner.

Kallisto compile this step together with pseudoalignment. The abundance estimate using kallisto on the data is in the abundance.TSV file. Abundances are reported in “estimated counts” (est_counts) and in Transcripts Per Million (TPM).

4.4.2 Identification of differentially expressed genes (DEGs)

Transcript-level abundance quatificaiton results from Kallisto, need to be summarizes into gene-level abundance for downstream DEG analysis.

library(tximportData)
library(tximport)
dir <-  "/bigdata/gen242/yxu1/rnaseq"
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "results", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:18)
all(file.exists(files))
library(AnnotationDbi)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]  # tx ID, then gene ID
head(tx2gene)
tx2gene <- read.csv(file.path(dir, "Arabidopsistx2gene.csv"))
head(tx2gene)
library(tximport)
library(readr)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)
names(txi)
head(txi$counts)
txi.tx <- tximport(files, type = "kallisto", txOut = FALSE, tx2gene = tx2gene, reader = read_tsv)
write.csv(txi.tx, file = "abundancegenelevel.csv")

4.4.3 Sample-wise correlation analysis

The following computes the sample-wise Spearman correlation coefficients from the rlog transformed expression values generated with the DESeq2 package. After transformation to a distance matrix, hierarchical clustering is performed with the hclust function and the result is plotted as a dendrogram (also see file sample_tree.pdf).

library(DESeq2, quietly=TRUE); library(ape,  warn.conflicts=FALSE)
countDF <- as.matrix(read.table("./results/countDFeByg.xls"))
colData <- data.frame(row.names=targetsin(args)$SampleName, condition=targetsin(args)$Factor)
dds <- DESeqDataSetFromMatrix(countData = countDF, colData = colData, design = ~ condition)
d <- cor(assay(rlog(dds)), method="spearman")
hc <- hclust(dist(1-d))
pdf("results/sample_tree.pdf")
plot.phylo(as.phylo(hc), type="p", edge.col="blue", edge.width=2, show.node.label=TRUE, no.margin=TRUE)
dev.off()

Figure 4: Correlation dendrogram of samples - Tophat

Figure 5: Correlation dendrogram of samples - HISAT

Figure 6: Correlation dendrogram of samples - Rsubread

Figure 7: Correlation dendrogram of samples - Kallisto

4.5 Analysis of differentially expressed genes with edgeR

The analysis of differentially expressed genes (DEGs) is performed with the glm method of the edgeR package (Robinson, McCarthy, and Smyth 2010). The sample comparisons used by this analysis are defined in the header lines of the targetsPE.txt file starting with <CMP>.

4.5.1 Run `edgeR`

library(edgeR)
countDF <- read.delim("results/countDFeByg.xls", row.names=1, check.names=FALSE) 
targets <- read.delim("targetsPE_trim.txt", comment="#")
cmp <- readComp(file="targetsPE_trim.txt", format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targets, cmp=cmp[[1]], independent=FALSE, mdsplot="")

Add gene descriptions

library("biomaRt")
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
desc <- getBM(attributes=c("tair_locus", "description"), mart=m)
desc <- desc[!duplicated(desc[,1]),]
descv <- as.character(desc[,2]); names(descv) <- as.character(desc[,1])
edgeDF <- data.frame(edgeDF, Desc=descv[rownames(edgeDF)], check.names=FALSE)
write.table(edgeDF, "./results/edgeRglm_allcomp.xls", quote=FALSE, sep="\t", col.names = NA)

4.5.2 Plot DEG results

Filter and plot DEG results for up and down regulated genes. The definition of up and down is given in the corresponding help file. To open it, type ?filterDEGs in the R console.

edgeDF <- read.delim("results/edgeRglm_allcomp.xls", row.names=1, check.names=FALSE) 
pdf("results/DEGcounts.pdf")
DEG_list <- filterDEGs(degDF=edgeDF, filter=c(Fold=2, FDR=20))
dev.off()
write.table(DEG_list$Summary, "./results/DEGcounts.xls", quote=FALSE, sep="\t", row.names=FALSE)

Note: The above code can be used to analyze data from Tophat, HISAT, and Rsubread aligners. The following is code for performing analysis of differentially expressed genes when using Kallisto as your aligner.

4.5.3 Analysis of differentially expressed genes with edgeR

4.5.3.1 Run `edgeR`

countDF <- read.csv(file = "abundancecountDF.csv", row.names=1, check.names=FALSE)
targets <- read.delim("targetsPE_trim.txt", comment="#")
cmp <- readComp(file="targetsPE_trim.txt", format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targets, cmp=cmp[[1]], independent=FALSE, mdsplot="")

Figure 8: Comparison of up and down regulated DEGs with FDR of 20% for four alignment tools

4.5.4 Venn diagrams of DEG sets

The overLapper function can compute Venn intersects for large numbers of sample sets (up to 20 or more) and plots 2-5 way Venn diagrams. A useful feature is the possiblity to combine the counts from several Venn comparisons with the same number of sample sets in a single Venn diagram (here for 4 up and down DEG sets).

vennsetup <- overLapper(DEG_list$Up[6:9], type="vennsets")
vennsetdown <- overLapper(DEG_list$Down[6:9], type="vennsets")
pdf("results/vennplot.pdf")
vennPlot(list(vennsetup, vennsetdown), mymain="", mysub="", colmode=2, ccol=c("blue", "red"))
dev.off()

Figure 9: Venn Diagram for 4 Up and Down DEG Sets - Tophat

Figure 10: Venn Diagram for 4 Up and Down DEG Sets - HISAT

Figure 11: Venn Diagram for 4 Up and Down DEG Sets - Rsubread

Figure 12: Venn Diagram for 4 Up and Down DEG Sets - Kallisto

4.6 GO term enrichment analysis of DEGs

4.6.1 Obtain gene-to-GO mappings

The following shows how to obtain gene-to-GO mappings from biomaRt (here for A. thaliana) and how to organize them for the downstream GO term enrichment analysis. Alternatively, the gene-to-GO mappings can be obtained for many organisms from Bioconductorâs *.db genome annotation packages or GO annotation files provided by various genome databases. For each annotation this relatively slow preprocessing step needs to be performed only once. Subsequently, the preprocessed data can be loaded with the load function as shown in the next subsection.

library("biomaRt")
listMarts() # To choose BioMart database
listMarts(host="plants.ensembl.org")
m <- useMart("plants_mart", host="plants.ensembl.org")
listDatasets(m)
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
listAttributes(m) # Choose data types you want to download
go <- getBM(attributes=c("go_accession", "tair_locus", "go_namespace_1003"), mart=m)
go <- go[go[,3]!="",]; go[,3] <- as.character(go[,3])
go[go[,3]=="molecular_function", 3] <- "F"; go[go[,3]=="biological_process", 3] <- "P"; go[go[,3]=="cellular_component", 3] <- "C"
go[1:4,]
dir.create("./data/GO")
write.table(go, "data/GO/GOannotationsBiomart_mod.txt", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")
catdb <- makeCATdb(myfile="data/GO/GOannotationsBiomart_mod.txt", lib=NULL, org="", colno=c(1,2,3), idconv=NULL)
save(catdb, file="data/GO/catdb.RData")

4.6.2 Batch GO term enrichment analysis

Apply the enrichment analysis to the DEG sets obtained the above differential expression analysis. Note, in the following example the FDR filter is set here to an unreasonably high value, simply because of the small size of the toy data set used in this vignette. Batch enrichment analysis of many gene sets is performed with the function. When method=all, it returns all GO terms passing the p-value cutoff specified under the cutoff arguments. When method=slim, it returns only the GO terms specified under the myslimv argument. The given example shows how a GO slim vector for a specific organism can be obtained from BioMart.

library("biomaRt")
load("data/GO/catdb.RData")
DEG_list <- filterDEGs(degDF=edgeDF, filter=c(Fold=2, FDR=50), plot=FALSE)
up_down <- DEG_list$UporDown; names(up_down) <- paste(names(up_down), "_up_down", sep="")
up <- DEG_list$Up; names(up) <- paste(names(up), "_up", sep="")
down <- DEG_list$Down; names(down) <- paste(names(down), "_down", sep="")
DEGlist <- c(up_down, up, down)
DEGlist <- DEGlist[sapply(DEGlist, length) > 0]
BatchResult <- GOCluster_Report(catdb=catdb, setlist=DEGlist, method="all", id_type="gene", CLSZ=2, cutoff=0.9, gocats=c("MF", "BP", "CC"), recordSpecGO=NULL)
library("biomaRt")
m <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
goslimvec <- as.character(getBM(attributes=c("goslim_goa_accession"), mart=m)[,1])
BatchResultslim <- GOCluster_Report(catdb=catdb, setlist=DEGlist, method="slim", id_type="gene", myslimv=goslimvec, CLSZ=10, cutoff=0.01, gocats=c("MF", "BP", "CC"), recordSpecGO=NULL)

4.6.3 Plot batch GO term results

The data.frame generated by GOCluster can be plotted with the goBarplot function. Because of the variable size of the sample sets, it may not always be desirable to show the results from different DEG sets in the same bar plot. Plotting single sample sets is achieved by subsetting the input data frame as shown in the first line of the following example.

gos <- BatchResultslim[grep("M6-V6_up_down", BatchResultslim$CLID), ]
gos <- BatchResultslim
pdf("./results/GOslimbarplotMF.pdf", height=8, width=10); goBarplot(gos, gocat="MF"); dev.off()

Figure 13: GO Slim Barplot for MF Ontology - Tophat

Figure 14: GO Slim Barplot for MF Ontology - HISAT

Figure 15: GO Slim Barplot for MF Ontology - Rsubread

Figure 16: GO Slim Barplot for MF Ontology - Kallisto

4.7 Clustering and heat maps

The following example performs hierarchical clustering on the rlog transformed expression matrix subsetted by the DEGs identified in the above differential expression analysis. It uses a Pearson correlation-based distance measure and complete linkage for cluster joining.

library(pheatmap)
geneids <- unique(as.character(unlist(DEG_list[[1]])))
y <- assay(rlog(dds))[geneids, ]
pdf("./results/heatmap1.pdf")
pheatmap(y, scale="row", clustering_distance_rows="correlation", clustering_distance_cols="correlation")
dev.off()

Figure 17: Heat Map with Hierarchical Clustering Dendrograms of DEGs - Tophat

Figure 18: Heat Map with Hierarchical Clustering Dendrograms of DEGs - HISAT

Figure 19: Heat Map with Hierarchical Clustering Dendrograms of DEGs - Rsubread

An evaluation of RNA-seq aligners and their impact on downstream analyses (RNA-Seq1)

Rachel Rattner (rratt001@ucr.edu)

Last update: 07 June, 2016

Contents

1 Abstract

2 Introduction

3 Methods

3.1 RNA-Seq Workflow

3.2 Challenge project: Comparison of RNA-Seq aligners