The data we will play with are from this study:
Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are β2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions.
Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a β2-agonist (i.e. Albuterol, 1μM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1μM for 18h); 4) simultaneous treatment with a β2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument.
To get the data out of the Sequence Read Archive (SRA), NCBI provides the SRA Toolkit. I install that here to perform the simultaneous download and fastq conversion.
download.file('http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.6.3/sratoolkit.2.6.3-ubuntu64.tar.gz',
'sratoolkit.2.6.3-ubuntu64.tar.gz')
untar('sratoolkit.2.6.3-ubuntu64.tar.gz',compressed=TRUE)
We will put all the FASTQ files in a single directory. Paired-end reads will be saved with filenames that match the original SRA accessions.
dir.create('fastq')
Sys.setenv('PATH'=paste(Sys.getenv("PATH"),'sratoolkit.2.6.3-ubuntu64/bin',sep=":"))
SRAdb packageIn order to find the associated SRA Run accessions for the SRA study, we will use the SRAdb package.
# install SRAdb if not already present
if(!require('SRAdb')) biocLite('SRAdb')
sqlf = 'SRAmetadb.sqlite'
# download the database if it is not already present
if(!file.exists(sqlf)) {
getSRAdbFile()
}
conn = dbConnect(SQLite(),sqlf)
srp = 'SRP033351'
dtable = sraConvert('SRP033351',out_type = c("sra", "submission", "study", "sample", "experiment", "run"),conn)
knitr::kable(dtable)
| study | submission | sample | experiment | run |
|---|---|---|---|---|
| SRP033351 | SRA114259 | SRS508581 | SRX384360 | SRR1039523 |
| SRP033351 | SRA114259 | SRS508573 | SRX384352 | SRR1039515 |
| SRP033351 | SRA114259 | SRS508570 | SRX384347 | SRR1039510 |
| SRP033351 | SRA114259 | SRS508572 | SRX384350 | SRR1039513 |
| SRP033351 | SRA114259 | SRS508582 | SRX384359 | SRR1039522 |
| SRP033351 | SRA114259 | SRS508575 | SRX384353 | SRR1039516 |
| SRP033351 | SRA114259 | SRS508574 | SRX384351 | SRR1039514 |
| SRP033351 | SRA114259 | SRS508577 | SRX384356 | SRR1039519 |
| SRP033351 | SRA114259 | SRS508576 | SRX384354 | SRR1039517 |
| SRP033351 | SRA114259 | SRS508580 | SRX384358 | SRR1039521 |
| SRP033351 | SRA114259 | SRS508579 | SRX384357 | SRR1039520 |
| SRP033351 | SRA114259 | SRS508571 | SRX384349 | SRR1039512 |
| SRP033351 | SRA114259 | SRS508569 | SRX384348 | SRR1039511 |
| SRP033351 | SRA114259 | SRS508567 | SRX384346 | SRR1039509 |
| SRP033351 | SRA114259 | SRS508568 | SRX384345 | SRR1039508 |
| SRP033351 | SRA114259 | SRS508578 | SRX384355 | SRR1039518 |
Now, we have the accessions of the SRA runs. These accessions contain the actual sequences, so we use the fastq-dump executable to download and dump fastq into the fastq directory.
fastqdump = function(accession,dir = getwd()) {
system(sprintf('fastq-dump --split-files --gzip -O %s %s',dir,accession))
}
library(BiocParallel)
register(MulticoreParam())
bplapply(dtable$run,fastqdump,dir='fastq')
# library(BiocInstaller)
# useDevel()
# biocLite('sevenbridges')
library(sevenbridges)
a<-Auth(user="sdavis2", platform = "cgc", token = "XXXXXXXXXXX")
p = a$project('CSHL-2016')
sapply(dir('fastq',full.names = TRUE),p$upload)
p$file('*fastq.gz')
lapply(p$file('*fastq.gz'),function(j) {
print(j$name)
sn = strsplit(j$name,'_')[[1]][1]
print(j$setMeta(sample_id=sn))
})