The following four source files will do the four analysis repectively.
### quality checking
source("profiling/1.RNA-seq/1.A.1_run_qc.R")
### set up alignment
source("profiling/1.RNA-seq/1.A.2_run_alignment.R")
### prepare genomic features and count reads in genes
source("profiling/1.RNA-seq/1.A.3_run_readcount.R")
### DE analysis
source("profiling/1.RNA-seq/1.A.4_run_DEseq2.R")
The sequencing reads are delivered to the user typically as a FASTQ file (with the extension of “.fastq” or “.fq”). These FASTQ files contain four lines for each read:
fq <- read.delim("../data/subset.fastq", header=FALSE)
fq
## V1
## 1 @SRR1170744.sra.10000001 ILLUMINA-BDFE38_17:8:30:12659:17592 length=120
## 2 GGTGTCTGATGTTTCGAGTGGTAGAAGGCTGAACGAGAGCGAATGTAGCGAAAGAACGAAGAAAGCTAAATATTATAGTTGTAGGTATAGTTCTGCTGTTAACGAGAGACAGAATTGGTA
## 3 +SRR1170744.sra.10000001 ILLUMINA-BDFE38_17:8:30:12659:17592 length=120
## 4 ED=B?CGFEGDFGGBECE8EGDGG<GEEBGGBGGB>@EBGD@DGGDDGDG??EE?DDGEBGEFDDEFCFCDEEEEHEHED<44=::????BBEEGEEGC>B@BDBDBD>;(>;69@D@@#
## 5 @SRR1170744.sra.10000002 ILLUMINA-BDFE38_17:8:30:12702:17594 length=120
## 6 CCCGGTGAGATGAATCTTCTGCAGAACACTTGGTGTGTAATAATAATGAAACCATAAAACAGAGGCCATCAATGTCAAAAGGCCAGGAATCATCGCGAATGATTGAAGGCTGATGATGAA
## 7 +SRR1170744.sra.10000002 ILLUMINA-BDFE38_17:8:30:12702:17594 length=120
## 8 IIIGIHIIIIIGIIIIIIGGIGHGIHIIIIIGIDIFGGEGGIIIIIIIIIIIIIIHIIIIIIIIIIIHIIIIIHHHIIHDIHFIHIIIFHIIHIIGIEHIIEHHHBCGGBFGCDE@BBBB
## 9 @SRR1170744.sra.10000003 ILLUMINA-BDFE38_17:8:30:12748:17589 length=120
## 10 CTCTCCTCTTCTTGCAACTTCTTTTTTTTTTTGTTCTTGTTTTTTTCTCTTTTCAGGCGAATGATGAGGTAGAAATAGGATGGGTTTTTAGCTACTAATTACATGATTTGATGGCTTTTG
The following chunk of code is the same as profiling/1.RNA-seq/1.A.1_run_qc.R. Basically, it loops through all the PE fastq files specified by the largedata/sample.txt. It will generate a slurm script to call FASTX-Toolkit to conduct the quality checking and quality filtering of the fastq files.
### input
fastqfile = "largedata/sample.txt"
### output scripts
shfile = "largedata/step1_qc.sh"
slurmfile = "largedata/slurm_step1_qc.sh"
### int passes to fastq_quality_filer Minimum quality score to keep
q = 25
### Minimum percent of bases that must have [-q] quality
p = 50
######################################################################
source("lib/PE_qc.R")
PE_qc(fqfile = fastqfile, shfile = shfile, q = q, p = p)
source("lib/setUpslurm.R")
setUpslurm(slurmsh=slurmfile, codesh=paste("sh", shfile), wd=NULL, jobid="qcjob")
You can submit the slurm job as following: sbatch -p serial --ntasks 1 largedata/slurm_step1_qc.sh
The results can be checked in the largedata/leaf/ and largedata/root/ folders. Here are two examples of the quality checking plots.
GSNAP was used for alignment with the following parameters changed.
gsnap -D largedata/OS_indica -d ASM465v1.25_gsnap -m 10 -i 2 -N 1 -w 10000 -A sam -t 8-n 3 --quality-protocol=sanger --nofails fq1 fq2 --split-output OUTPUT
###################################################
### prepare genomic features
# More Robust: Store Annotations in TranscriptDb
###################################################
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicFeatures")
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicAlignments")
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")
library(GenomicFeatures)
library(GenomicAlignments)
txdb <- makeTranscriptDbFromGFF(file="largedata/OS_204_v7/Osativa_204_v7.0.gene_exons",
format="gff3",
dataSource="phytozome",
species="oryza")
saveDb(txdb, file="largedata/Osativa_204_v7.0.sqlite")
It is important to keep in mind the nature of those “features” to which reads were mapped for counting. If, for example, reads are mapped to a set of references that include different splice variants of the same gene, those splice variants will each be analyzed separately unless treated as a unit. The nature of the features encompassed by read count data depends on what the mapping reference is (e.g. a genome or transcriptome assembly), and what type of aligner was used. Some pieces of analysis software are designed to handle isoform differences (Leng et al. 2013; Trapnell et al. 2013), and others analyze generic features of the transcriptome, so please consider this when selecting both read alignment and differential expression software.
source("profiling/1.RNA-seq/1.A.3_run_readcount.R")
QC check of the sample reproducibility by computing a correlating matrix and plotting it as a tree.
#Note: the plotMDS function from edgeR is a more robust method for this task.
library(ape)
rpkm <- read.csv("../largedata/rpkm.csv")
d <- cor(rpkm, method="spearman")
hc <- hclust(dist(1-d))
plot.phylo(as.phylo(hc), type="p", edge.col=4, edge.width=3, show.node.label=TRUE, no.margin=TRUE)
Important! Raw count data are expected here for DESeq2!
A formula which specifies the design of the experiment, taking the form formula(~ x + y + z). By default, the functions in this package will use the last variable in the formula (e.g. z) for presenting results (fold changes, etc.) and plotting.
source("profiling/1.RNA-seq/1.A.4_run_DEseq2.R")