Introduction

In this little workflow, we will be using a relatively new technology, pseudoalignment and quantification to deal with RNA-seq data from eight samples. The technical steps are:

Use the SRA SDK to download FASTQ files for each sample
Build a transcriptome index for Kallisto
Pseudoalignment and quantification with Kallisto
Read Kallisto output into a SummarizedExperiment object

Technical skills being showcased include:

Accessing data from SRA using R/SRA SDK
Use of R system() functionality to orchestrate workflows involving command-line programs
Parallel processing with BiocParallel
RNA-seq processing with Kallisto

Getting FASTQ Files

We can use the system() R function to call shell functions and scripts. In this case, we are going to use the fastq-dump utility to access SRA accessions. You will be using these data to do the RNA-seq differential expression lab later, though the reads for that lab have been processed using a more traditional protocol.

system('fastq-dump --help')

Based on the output of the fastq-dump help, we can write a small function that will take an SRA accession and dump out paired-end fastq files that are pre-gzipped. To keep things quick, I specify a default of 1e6 reads, but you can change to much larger if you have a lot of time on your hands. With 1e6 reads, each file will be about 90MB in size.

fastqDump = function(sraAccession,maxReads = 1000000) {
  system(sprintf('fastq-dump --split-files --gzip --maxSpotId %d %s',maxReads,sraAccession))
}

We can use the fact that we have multiple cores to process data in parallel.

library(BiocParallel)

How many cores does my machine have?

workers = multicoreWorkers()
workers

## [1] 4

We can use register() to tell R to use all our machine cores, in this case 4.

register(MulticoreParam())

Before we get going, let’s look at a record for one accession, ‘SRR479055’. We will use the dnanexus portal, just because it is very pretty. The same data are available at SRA at NCBI.

browseURL('http://sra.dnanexus.com/runs/SRR479055')

In the next line, I have created a set of accessions that we are going to get with fastq-dump. Again, this will be done in parallel in chunks of 4.

accessions = paste0('SRR4790',52:59)

The next line will run fastq-dump in parallel. Watch the file panel in RStudio to see things in action.

bplapply(accessions,fastqDump)

## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 0
## 
## [[4]]
## [1] 0
## 
## [[5]]
## [1] 0
## 
## [[6]]
## [1] 0
## 
## [[7]]
## [1] 0
## 
## [[8]]
## [1] 0

Running Kallisto

Again, we will be using the system() R function to call the kallisto RNA-seq quantification system.

system('kallisto')

Indexing the transcriptome

I have already downloaded a relatively complete transcriptome in fasta format. Kallisto will index this file, even in gzipped format.

idxFile = 'GRC38.bld79.kallisto'
system(sprintf('kallisto index -i %s /data/kallisto/Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz',
               idxFile))

Pseudoalignment and Quantification

Again, we will write a simple function to wrap up our command-line program, kallisto, to run based on the idxFile and the accessions.

runKallisto = function(accession,idxFile) {
  system(sprintf('kallisto quant --plaintext -i %s --output-dir=%s %s_1.fastq.gz %s_2.fastq.gz',
                 idxFile,accession,accession,accession))
}

And, now run it in parallel over the accessions.

register(MulticoreParam(2))
bplapply(accessions,runKallisto,idxFile)

## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 0
## 
## [[4]]
## [1] 0
## 
## [[5]]
## [1] 0
## 
## [[6]]
## [1] 0
## 
## [[7]]
## [1] 0
## 
## [[8]]
## [1] 0

Note that as the code runs, new folders are being created that contain the results.

Martin Morgan has written some short code to read in kallisto output. In the next couple of lines, we load that code so we can use it to read in our results.

library(RCurl)
source(textConnection(getURL('https://gist.githubusercontent.com/mtmorgan/eaf456ad5b45c431c494/raw/2700e5002a0145ab00bf32bdb07edd2052c4c842/readKallisto.R')))

We need to find the output files for each sample. We do that by searching for any files named abundance.txt since that is the default filename created by kallisto. After finding all those files, we can read them into R as a SummarizedExperiment. Note that you will be creating this type of data structure in the differential expression lab.

fileList = list.files('.',pattern='abundance.txt',recursive=TRUE,full.names = TRUE)
res = readKallisto(fileList, what=c("tpm", "est_counts", "eff_length"), as="SummarizedExperiment")
res

## class: SummarizedExperiment 
## dim: 173259 8 
## exptData(0):
## assays(3): tpm est_counts eff_length
## rownames(173259): ENST00000415118 ENST00000448914 ...
##   ENST00000630922 ENST00000630347
## rowRanges metadata column names(1): length
## colnames(8): SRR479052 SRR479053 ... SRR479058 SRR479059
## colData names(6): n_targets n_bootstraps ... start_time call

Pseudoalignment and Quantification of RNA-seq Reads

Sean Davis

June 25, 2015

Introduction

Getting FASTQ Files

Running Kallisto

Indexing the transcriptome

Pseudoalignment and Quantification