RNA-seq data preparation

Download RNA-seq data

All the RNA-seq data used in this class were downloaded from NCBI Sequence Read Archive (SRA) database using wget. Note: the RNA-seq data and intermediate outputs are put in the folder of largedata.

cd largedata/leaf/
### Oryza barthii, leaf RNA-seq, PE, 121 bp
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1170742/SRR1170742.sra
### extract the PE reads from sra format
fastq-dump --split-spot --split-3 -A SRR1170742.sra

### Oryza barthii, root RNA-seq, PE, 121bp
cd largedata/root/
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1170744/SRR1170744.sra
fastq-dump --split-spot --split-3 -A SRR1170744.sra

cd glumipatula/
### Oryza glumipatula, leaf RNA-seq, PE, 101bp
wget get ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174772/SRR1174772.sra
### Oryza glumipatula, panicle RNA-seq, PE, 101bp
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174773/SRR1174773.sra
### Oryza glumipatula, root RNA-seq, PE, 101bp
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174777/SRR1174777.sra

### dump the SRA into fastq
fastq-dump --split-spot --split-3 -A SRR1174772.sra
fastq-dump --split-spot --split-3 -A SRR1174777.sra

Data for homework `Oryza glumipatula`

### Oryza glumipatula root RNA seq

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174777/SRR1174777.sra

cd largedata/
### checking the total number of reads
wc -l SRR1170742.sra_1.fastq 
#160468896 SRR1170742.sra_1.fastq

### split the fq files into three files as arbitrarily assigned replications
cat SRR1170742.sra_1.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > leaf/leaf.rep1_1.fastq
cat SRR1170742.sra_2.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > leaf/leaf.rep1_2.fastq
cat SRR1170742.sra_1.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > leaf/leaf.rep2_1.fastq
cat SRR1170742.sra_2.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > leaf/leaf.rep2_2.fastq
cat SRR1170742.sra_1.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > leaf/leaf.rep3_1.fastq
cat SRR1170742.sra_2.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > leaf/leaf.rep3_2.fastq

wc -l glumipatula/SRR1174777.sra_1.fastq 
#665118764 SRR1174777.sra_1.fastq

cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > root/root.rep1_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 4000001  && NR <= 8000000 { print }' > root/root.rep1_2.fastq
cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > root/root.rep2_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 14000001  && NR <= 18000000 { print }' > root/root.rep2_2.fastq
cat glumipatula/SRR1174777.sra_1.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > root/root.rep3_1.fastq
cat glumipatula/SRR1174777.sra_2.fastq | awk 'NR >= 24000001  && NR <= 28000000 { print }' > root/root.rep3_2.fastq


module load gmap/2014-05-15
gsnap -D largedata/OS_indica -d ASM465v1.25_gsnap -m 10 -i 2 -N 1 -w 10000 -A sam -t 8 -n 3 --quality-protocol=sanger --nofails largedata/root/root.rep1_1.fastq largedata/root/root.rep1_2.fastq --split-output largedata/root/root.rep1

Create a txt file

Open vi, specify the relative location and names of the fastq files. And, the replications and treatments of your experimental design.

### copy the following text into your sample.txt file.
fq1 fq2 rep tissue
leaf/leaf.rep1_1.fastq leaf/leaf.rep1_2.fastq rep1 leaf
leaf/leaf.rep2_1.fastq leaf/leaf.rep2_2.fastq rep2 leaf
leaf/leaf.rep3_1.fastq leaf/leaf.rep3_2.fastq rep3 leaf
root/root.rep1_1.fastq root/root.rep1_2.fastq rep1 root
root/root.rep2_1.fastq root/root.rep2_2.fastq rep2 root
root/root.rep3_1.fastq root/root.rep3_2.fastq rep3 root

Download reference genome and annotation data

Two reference genomes for rice are available. Oryza sativa indica is one of the them. The genome assembly and gene annotation can be downloaded from EnsemblPlants.

mkdir OS_indica
cd OS_indica
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-25/fasta/oryza_indica/dna/Oryza_indica.ASM465v1.25.dna.chromosome.*.fa.gz

Setup alignment database

GSNAP (Genomic Short-read Nucleotide Alignment Program) is a fast and splice-aware aligner. You can download and install it from here.

### load the module of gmap if you use farm
module load gmap/2014-05-15
### http://research-pub.gene.com/gmap/src/README
### Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry)
### Note: be careful about your dir specified by -D
gmap_build -D largedata/OS_indica/ -d ASM465v1.25_gsnap -g Oryza_indica.ASM465v1.25.dna.chromosome*fa.gz > gmapbuild.log

RNA-seq data preparation

Jinliang Yang, Ross-Ibarra lab

February 17, 2015

Download RNA-seq data

Data for homework `Oryza glumipatula`

Create a txt file

Download reference genome and annotation data

Setup alignment database

RNA-seq data preparation

Jinliang Yang, Ross-Ibarra lab

February 17, 2015

Download RNA-seq data

Data for homework Oryza glumipatula

Create a txt file

Download reference genome and annotation data

Setup alignment database

Data for homework `Oryza glumipatula`