Short Reads Processing

RNA-seq Differential Expression Analysis in Four Steps

Quality checking and data cleaning
Aligning RNA-seq reads to reference genome
Count reads in gene models
Differential gene expression study

The following four source files will do the four analysis repectively.

### quality checking
source("profiling/1.RNA-seq/1.A.1_run_qc.R")
### set up alignment
source("profiling/1.RNA-seq/1.A.2_run_alignment.R")
### prepare genomic features and count reads in genes
source("profiling/1.RNA-seq/1.A.3_run_readcount.R")
### DE analysis
source("profiling/1.RNA-seq/1.A.4_run_DEseq2.R")

FASTQ file

The sequencing reads are delivered to the user typically as a FASTQ file (with the extension of “.fastq” or “.fq”). These FASTQ files contain four lines for each read:

line 1: sequence id
line 2: nucleotide sequence
line 3: a “+” sign separator, optionally with the read identifier repeated
line 4: a corresponding ASCII string of quality characters

fq <- read.delim("../data/subset.fastq", header=FALSE)
fq

##                                                                                                                          V1
## 1                                                   @SRR1170744.sra.10000001 ILLUMINA-BDFE38_17:8:30:12659:17592 length=120
## 2  GGTGTCTGATGTTTCGAGTGGTAGAAGGCTGAACGAGAGCGAATGTAGCGAAAGAACGAAGAAAGCTAAATATTATAGTTGTAGGTATAGTTCTGCTGTTAACGAGAGACAGAATTGGTA
## 3                                                   +SRR1170744.sra.10000001 ILLUMINA-BDFE38_17:8:30:12659:17592 length=120
## 4  ED=B?CGFEGDFGGBECE8EGDGG<GEEBGGBGGB>@EBGD@DGGDDGDG??EE?DDGEBGEFDDEFCFCDEEEEHEHED<44=::????BBEEGEEGC>B@BDBDBD>;(>;69@D@@#
## 5                                                   @SRR1170744.sra.10000002 ILLUMINA-BDFE38_17:8:30:12702:17594 length=120
## 6  CCCGGTGAGATGAATCTTCTGCAGAACACTTGGTGTGTAATAATAATGAAACCATAAAACAGAGGCCATCAATGTCAAAAGGCCAGGAATCATCGCGAATGATTGAAGGCTGATGATGAA
## 7                                                   +SRR1170744.sra.10000002 ILLUMINA-BDFE38_17:8:30:12702:17594 length=120
## 8  IIIGIHIIIIIGIIIIIIGGIGHGIHIIIIIGIDIFGGEGGIIIIIIIIIIIIIIHIIIIIIIIIIIHIIIIIHHHIIHDIHFIHIIIFHIIHIIGIEHIIEHHHBCGGBFGCDE@BBBB
## 9                                                   @SRR1170744.sra.10000003 ILLUMINA-BDFE38_17:8:30:12748:17589 length=120
## 10 CTCTCCTCTTCTTGCAACTTCTTTTTTTTTTTGTTCTTGTTTTTTTCTCTTTTCAGGCGAATGATGAGGTAGAAATAGGATGGGTTTTTAGCTACTAATTACATGATTTGATGGCTTTTG

Quality checking and data cleaning

The following chunk of code is the same as profiling/1.RNA-seq/1.A.1_run_qc.R. Basically, it loops through all the PE fastq files specified by the largedata/sample.txt. It will generate a slurm script to call FASTX-Toolkit to conduct the quality checking and quality filtering of the fastq files.

### input
fastqfile = "largedata/sample.txt"
### output scripts
shfile = "largedata/step1_qc.sh"
slurmfile = "largedata/slurm_step1_qc.sh"
### int passes to fastq_quality_filer Minimum quality score to keep
q = 25
### Minimum percent of bases that must have [-q] quality
p = 50

######################################################################
source("lib/PE_qc.R")
PE_qc(fqfile = fastqfile, shfile = shfile, q = q, p = p)

source("lib/setUpslurm.R")
setUpslurm(slurmsh=slurmfile, codesh=paste("sh", shfile), wd=NULL, jobid="qcjob")

You can submit the slurm job as following: sbatch -p serial --ntasks 1 largedata/slurm_step1_qc.sh

The results can be checked in the largedata/leaf/ and largedata/root/ folders. Here are two examples of the quality checking plots.

alt alt

Aligning RNA-seq reads to reference genome

GSNAP was used for alignment with the following parameters changed.

gsnap -D largedata/OS_indica -d ASM465v1.25_gsnap -m 10 -i 2 -N 1 -w 10000 -A sam -t 8
-n 3 --quality-protocol=sanger --nofails fq1 fq2 --split-output OUTPUT

-m miss match
-i indel penalty
-N: look for novel splicing, 1=yes
-w: definition of local novel splicing event
-A: output sam format
-t: number of CPU to use
-n: max number of paths to print

Count reads in gene models

###################################################
### prepare genomic features
# More Robust: Store Annotations in TranscriptDb
###################################################
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicFeatures")
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicAlignments")
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

library(GenomicFeatures)
library(GenomicAlignments)

txdb <- makeTranscriptDbFromGFF(file="largedata/OS_204_v7/Osativa_204_v7.0.gene_exons",
                                format="gff3",
                                dataSource="phytozome",
                                species="oryza")

saveDb(txdb, file="largedata/Osativa_204_v7.0.sqlite")

Compute raw read count and conduct RPKM normalization

It is important to keep in mind the nature of those “features” to which reads were mapped for counting. If, for example, reads are mapped to a set of references that include different splice variants of the same gene, those splice variants will each be analyzed separately unless treated as a unit. The nature of the features encompassed by read count data depends on what the mapping reference is (e.g. a genome or transcriptome assembly), and what type of aligner was used. Some pieces of analysis software are designed to handle isoform differences (Leng et al. 2013; Trapnell et al. 2013), and others analyze generic features of the transcriptome, so please consider this when selecting both read alignment and differential expression software.

source("profiling/1.RNA-seq/1.A.3_run_readcount.R")

QC checking of RPKM

QC check of the sample reproducibility by computing a correlating matrix and plotting it as a tree.

#Note: the plotMDS function from edgeR is a more robust method for this task.
library(ape)
rpkm <- read.csv("../largedata/rpkm.csv")
d <- cor(rpkm, method="spearman")
hc <- hclust(dist(1-d))
plot.phylo(as.phylo(hc), type="p", edge.col=4, edge.width=3, show.node.label=TRUE, no.margin=TRUE)

Differential gene expression study

Important! Raw count data are expected here for DESeq2!

design

A formula which specifies the design of the experiment, taking the form formula(~ x + y + z). By default, the functions in this package will use the last variable in the formula (e.g. z) for presenting results (fold changes, etc.) and plotting.

source("profiling/1.RNA-seq/1.A.4_run_DEseq2.R")