Lab09

RNA sequencing data analysis - 1

Processing sequencing data and obtaining counts matrix

Retrieve the raw sequencing data

You can download the data from the public raw data repository SRA (Short Read Archieve) of NCBI. However, in this tutorial, you will get it directly from this link:

https://drive.google.com/drive/folders/1VLyCxxBtP5aM3kQ-8S0y6_cQLtO3s1Vu

Throughout this tutorial, we will use conda environment we created earlier.


conda create -n bfblab2 -c bioconda igv star samtools htseq

conda activate bfblab2

Aligning reads using STAR is a two step process:

Create a genome index
Map reads to the genome

As a read aligner, we will use STAR software on the command line.

Running multiple various jobs simultaneously on the command line (terminal) can get messy easily and would be hard to keep track. So, we highly recommend you to organize your code as a bash script as follows:

Let’s go ahead and create a file using gedit, and write the commands below. Then, save it by calling it Process_RNAseq.sh

#!/bin/bash

#Lines that start with a '#' are comments as in python syntax

#Create a folder to put STAR index files into.
mkdir chr1_hg38_index

#Before running any thing, just make sure that you have downloaded all the required data using the link above! To check:
ls -al ./

#Creating an genome index
STAR --runThreadN 2 \
--runMode genomeGenerate \
--genomeDir chr1_hg38_index \
--genomeFastaFiles Hg38_chr1.fasta \
--sjdbGTFfile Hg38_chr1.gff3 \
--sjdbOverhang 62 \
--genomeSAindexNbases 12

#Checking the output folder of the index files:
ls -al chr1_hg38_index/

#Create another folder for the alignment outputs:
mkdir starTest/

#Unzip the reads! STAR doesn't like gzipped files!
gunzip SRR1039512sub_*.fastq.gz

#Increase the process limits
ulimit -n 1000

#Aligning reads:
STAR --genomeDir chr1_hg38_index/ \
--runThreadN 6 \
--readFilesIn SRR1039512sub_1.fastq SRR1039512sub_2.fastq \
--outFileNamePrefix starTest/ \
--outSAMtype BAM SortedByCoordinate

#Check the outputs folder:
ls -al starTest/

#We need to index the alignment bam file
samtools index starTest/Aligned.sortedByCoord.out.bam

To run the bash script, go back to the terminal and run:

bash Process_RNAseq.sh

Visualization using IGV

You need to start it by executing this command on the terminal

igv

This will open a new GUI (graphical user interphase) for IGV software. We will load a reference genome to start. Next, load the sorted and indexed bam alignment file.

Upload the genome file and the alignment BAM file

Go to the NC_000001.11:148,100,770-148,153,561 to visualize the read alignments on the NBPF gene.

Counting the reads per gene

Same as we did above for aligning reads, we will create another bash script to count the reads:

#!/bin/bash
htseq-count -i gene_name \
-f bam starTest/Aligned.sortedByCoord.out.bam \
RNAseqTutorial_1_data/Hg38_chr1_exons.gtf > SRR1039512.counts

#Check the output:
less SRR1039512.counts