In this little workflow, we will be using a relatively new technology, pseudoalignment and quantification to deal with RNA-seq data from eight samples. The technical steps are:
SummarizedExperiment objectTechnical skills being showcased include:
system() functionality to orchestrate workflows involving command-line programsWe can use the system() R function to call shell functions and scripts. In this case, we are going to use the fastq-dump utility to access SRA accessions. You will be using these data to do the RNA-seq differential expression lab later, though the reads for that lab have been processed using a more traditional protocol.
system('fastq-dump --help')
Based on the output of the fastq-dump help, we can write a small function that will take an SRA accession and dump out paired-end fastq files that are pre-gzipped. To keep things quick, I specify a default of 1e6 reads, but you can change to much larger if you have a lot of time on your hands. With 1e6 reads, each file will be about 90MB in size.
fastqDump = function(sraAccession,maxReads = 1000000) {
system(sprintf('fastq-dump --split-files --gzip --maxSpotId %d %s',maxReads,sraAccession))
}
We can use the fact that we have multiple cores to process data in parallel.
library(BiocParallel)
How many cores does my machine have?
workers = multicoreWorkers()
workers
## [1] 4
We can use register() to tell R to use all our machine cores, in this case 4.
register(MulticoreParam())
Before we get going, let’s look at a record for one accession, ‘SRR479055’. We will use the dnanexus portal, just because it is very pretty. The same data are available at SRA at NCBI.
browseURL('http://sra.dnanexus.com/runs/SRR479055')
In the next line, I have created a set of accessions that we are going to get with fastq-dump. Again, this will be done in parallel in chunks of 4.
accessions = paste0('SRR4790',52:59)
The next line will run fastq-dump in parallel. Watch the file panel in RStudio to see things in action.
bplapply(accessions,fastqDump)
## [[1]]
## [1] 0
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 0
##
## [[4]]
## [1] 0
##
## [[5]]
## [1] 0
##
## [[6]]
## [1] 0
##
## [[7]]
## [1] 0
##
## [[8]]
## [1] 0
Again, we will be using the system() R function to call the kallisto RNA-seq quantification system.
system('kallisto')
I have already downloaded a relatively complete transcriptome in fasta format. Kallisto will index this file, even in gzipped format.
idxFile = 'GRC38.bld79.kallisto'
system(sprintf('kallisto index -i %s /data/kallisto/Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz',
idxFile))
Again, we will write a simple function to wrap up our command-line program, kallisto, to run based on the idxFile and the accessions.
runKallisto = function(accession,idxFile) {
system(sprintf('kallisto quant --plaintext -i %s --output-dir=%s %s_1.fastq.gz %s_2.fastq.gz',
idxFile,accession,accession,accession))
}
And, now run it in parallel over the accessions.
register(MulticoreParam(2))
bplapply(accessions,runKallisto,idxFile)
## [[1]]
## [1] 0
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 0
##
## [[4]]
## [1] 0
##
## [[5]]
## [1] 0
##
## [[6]]
## [1] 0
##
## [[7]]
## [1] 0
##
## [[8]]
## [1] 0
Note that as the code runs, new folders are being created that contain the results.
Martin Morgan has written some short code to read in kallisto output. In the next couple of lines, we load that code so we can use it to read in our results.
library(RCurl)
source(textConnection(getURL('https://gist.githubusercontent.com/mtmorgan/eaf456ad5b45c431c494/raw/2700e5002a0145ab00bf32bdb07edd2052c4c842/readKallisto.R')))
We need to find the output files for each sample. We do that by searching for any files named abundance.txt since that is the default filename created by kallisto. After finding all those files, we can read them into R as a SummarizedExperiment. Note that you will be creating this type of data structure in the differential expression lab.
fileList = list.files('.',pattern='abundance.txt',recursive=TRUE,full.names = TRUE)
res = readKallisto(fileList, what=c("tpm", "est_counts", "eff_length"), as="SummarizedExperiment")
res
## class: SummarizedExperiment
## dim: 173259 8
## exptData(0):
## assays(3): tpm est_counts eff_length
## rownames(173259): ENST00000415118 ENST00000448914 ...
## ENST00000630922 ENST00000630347
## rowRanges metadata column names(1): length
## colnames(8): SRR479052 SRR479053 ... SRR479058 SRR479059
## colData names(6): n_targets n_bootstraps ... start_time call