RNA-seq data preparation

Download RNA-seq data

All the RNA-seq data used in this class were downloaded from NCBI Sequence Read Archive (SRA) database using wget. Note: the RNA-seq data and intermediate outputs are put in the folder of largedata.

cd glumipatula/
### Oryza glumipatula, leaf RNA-seq, PE, 101bp
wget get ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174772/SRR1174772.sra
### Oryza glumipatula, panicle RNA-seq, PE, 101bp
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174773/SRR1174773.sra
### Oryza glumipatula, root RNA-seq, PE, 101bp
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174777/SRR1174777.sra

### dump the SRA into fastq
fastq-dump --split-spot --split-3 -A SRR1174772.sra
fastq-dump --split-spot --split-3 -A SRR1174777.sra

Simulate an experiment with three replications

The aim of this experiment is to compare the differentially expressed genes in two different tissue type: leaf vs. root. One million reads were extracted for each rep.

cd largedata/
### checking the total number of reads
wc -l glumipatula/SRR1174772.sra_1.fastq 
#660468896 SRR1170742.sra_1.fastq

### split the fq files into three files as arbitrarily assigned replications
cat glumipatula/SRR1174772.sra_1.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > leaf/leaf.rep1_1.fastq
cat glumipatula/SRR1174772.sra_2.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > leaf/leaf.rep1_2.fastq
cat glumipatula/SRR1174772.sra_1.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > leaf/leaf.rep2_1.fastq
cat glumipatula/SRR1174772.sra_2.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > leaf/leaf.rep2_2.fastq
cat glumipatula/SRR1174772.sra_1.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > leaf/leaf.rep3_1.fastq
cat glumipatula/SRR1174772.sra_2.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > leaf/leaf.rep3_2.fastq

wc -l glumipatula/SRR1174777.sra_1.fastq 
#665118764 SRR1174777.sra_1.fastq

cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > root/root.rep1_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > root/root.rep1_2.fastq
cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > root/root.rep2_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > root/root.rep2_2.fastq
cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > root/root.rep3_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > root/root.rep3_2.fastq

Log sample info to a txt file

Open vi, specify the relative location and names of the fastq files. And, the replications and treatments of your experimental design.

### copy the following text into your sample.txt file.
fq1 fq2 rep tissue
leaf/leaf.rep1_1.fastq leaf/leaf.rep1_2.fastq rep1 leaf
leaf/leaf.rep2_1.fastq leaf/leaf.rep2_2.fastq rep2 leaf
leaf/leaf.rep3_1.fastq leaf/leaf.rep3_2.fastq rep3 leaf
root/root.rep1_1.fastq root/root.rep1_2.fastq rep1 root
root/root.rep2_1.fastq root/root.rep2_2.fastq rep2 root
root/root.rep3_1.fastq root/root.rep3_2.fastq rep3 root

Download reference genome and annotation data

Two reference genomes for rice are available. Oryza sativa indica is one of the them. The genome assembly and gene annotation can be downloaded from EnsemblPlants.

mkdir OS_indica
cd OS_indica
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-25/fasta/oryza_indica/dna/Oryza_indica.ASM465v1.25.dna.chromosome.*.fa.gz

Setup alignment database

GSNAP (Genomic Short-read Nucleotide Alignment Program) is a fast and splice-aware aligner. You can download and install it from here.

### load the module of gmap if you use farm
module load gmap/2014-05-15
### http://research-pub.gene.com/gmap/src/README
### Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry)
### Note: be careful about your dir specified by -D
gmap_build -D largedata/OS_indica/ -d ASM465v1.25_gsnap -g Oryza_indica.ASM465v1.25.dna.chromosome*fa.gz > gmapbuild.log