Contents

1 Introduction

The data we will play with are from this study:

Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are β2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions.

Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a β2-agonist (i.e. Albuterol, 1μM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1μM for 18h); 4) simultaneous treatment with a β2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument.

2 Preparing data

2.1 Install the SRA Toolkit

To get the data out of the Sequence Read Archive (SRA), NCBI provides the SRA Toolkit. I install that here to perform the simultaneous download and fastq conversion.

download.file('http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.6.3/sratoolkit.2.6.3-ubuntu64.tar.gz',
              'sratoolkit.2.6.3-ubuntu64.tar.gz')
untar('sratoolkit.2.6.3-ubuntu64.tar.gz',compressed=TRUE)

We will put all the FASTQ files in a single directory. Paired-end reads will be saved with filenames that match the original SRA accessions.

dir.create('fastq')
Sys.setenv('PATH'=paste(Sys.getenv("PATH"),'sratoolkit.2.6.3-ubuntu64/bin',sep=":"))

2.2 Find SRA accessions using SRAdb package

In order to find the associated SRA Run accessions for the SRA study, we will use the SRAdb package.

# install SRAdb if not already present
if(!require('SRAdb')) biocLite('SRAdb')
sqlf = 'SRAmetadb.sqlite'
# download the database if it is not already present
if(!file.exists(sqlf)) {
  getSRAdbFile()
}
conn = dbConnect(SQLite(),sqlf)
srp    = 'SRP033351'
dtable = sraConvert('SRP033351',out_type = c("sra", "submission", "study", "sample", "experiment", "run"),conn)
knitr::kable(dtable)
study submission sample experiment run
SRP033351 SRA114259 SRS508581 SRX384360 SRR1039523
SRP033351 SRA114259 SRS508573 SRX384352 SRR1039515
SRP033351 SRA114259 SRS508570 SRX384347 SRR1039510
SRP033351 SRA114259 SRS508572 SRX384350 SRR1039513
SRP033351 SRA114259 SRS508582 SRX384359 SRR1039522
SRP033351 SRA114259 SRS508575 SRX384353 SRR1039516
SRP033351 SRA114259 SRS508574 SRX384351 SRR1039514
SRP033351 SRA114259 SRS508577 SRX384356 SRR1039519
SRP033351 SRA114259 SRS508576 SRX384354 SRR1039517
SRP033351 SRA114259 SRS508580 SRX384358 SRR1039521
SRP033351 SRA114259 SRS508579 SRX384357 SRR1039520
SRP033351 SRA114259 SRS508571 SRX384349 SRR1039512
SRP033351 SRA114259 SRS508569 SRX384348 SRR1039511
SRP033351 SRA114259 SRS508567 SRX384346 SRR1039509
SRP033351 SRA114259 SRS508568 SRX384345 SRR1039508
SRP033351 SRA114259 SRS508578 SRX384355 SRR1039518

2.3 Get FASTQ from SRA

Now, we have the accessions of the SRA runs. These accessions contain the actual sequences, so we use the fastq-dump executable to download and dump fastq into the fastq directory.

fastqdump = function(accession,dir = getwd()) {
  system(sprintf('fastq-dump --split-files --gzip -O %s %s',dir,accession))
}
library(BiocParallel)
register(MulticoreParam())
bplapply(dtable$run,fastqdump,dir='fastq')

2.4 Upload to SevenBridges cloud

# library(BiocInstaller)
# useDevel()
# biocLite('sevenbridges')
library(sevenbridges)
a<-Auth(user="sdavis2", platform = "cgc", token = "XXXXXXXXXXX")
p = a$project('CSHL-2016')
sapply(dir('fastq',full.names = TRUE),p$upload)
p$file('*fastq.gz')
lapply(p$file('*fastq.gz'),function(j) {
  print(j$name)
  sn = strsplit(j$name,'_')[[1]][1]
  print(j$setMeta(sample_id=sn))
})