Problem 1: STAR alignment

We will give you a simple example to test high throughput sequencing alignment for RNA-seq data. Normally for paired-end sequencing data, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. Read mapping could take a long time, so we have created just two FASTQ files of one RNA-seq sample with only 30,000 fragments (2 * 30,000 reads) for you to run STAR instead of the full data. The files are located on Huawei Cloud at /mnt/data/data/HW1/raw_data1. The mapping will generate one single output file.

Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the reference genome, available on Huawei Cloud at /mnt/data/data/HW1/index/star_hg38_index. Use the paired-end alignment mode and generate the output in SAM format. Please include full STAR report.How many reads are mappable and how many are uniquely mappable?

Sequencing data:
/mnt/data/data/HW1/raw_data1

module: STAR_2.7.10a
index: /mnt/data/data/HW1/index/star_hg38_index
# your bash code here

install.packages(‘rlang’) library(‘rlang’)

Problem 2: RNA-seq quality control

You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. To save time, we have provided four bam file generated by STAR in /mnt/data/data/HW1/raw_data2.

Please use RSeQC (Liguo Wang et al, Bioinformatics 2012) geneBody_coverage.py and tin.py modules to determine whether any of the samples exhibit unusual quality control metrics. To expedite the process, you can use housekeeping genes as reference, which provided in /mnt/data/data/HW1/raw_data2. Overall, identify the best and worst libraries. Your answer should include screen shots and tables as necessary as if you were delivering a report to the collaborator.

data:
/mnt/data/data/HW1/raw_data2/bamFile
module:RSeQC v4.0.0
# your bash code here

Problem 3: Python programming

One output of RSeQC is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 4 samples together to quickly perform quality control. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (2 rows x 2 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Please identify the worst sample by your result.

# your python code here

Problem 4: RNA-seq quantification

Transcript quantification plays an important role in the analysis of RNA-seq data. A large number of tools have been developed to quantify expression at the transcript level. RSEM (Bo Li et al, BMC Bioinformatics 2011) is a software package for estimating gene and isoform expression levels from single-end or paired-end RNA-Seq data, it can perform the alignment step with three different aligners: bowtie, bowtie2, or STAR. Salmon (Rob Patro et al, Nature Methods 2017) is an ultra-fast alignment-free methods, which also can correct for GC-bias.

Please run STAR+RSEM and Salmon on one good quality sample based on previous discussion to get FPKM and TPM. Identify the transcript and gene with the highest expression in this library from the Salmon output.

data: /mnt/data/data/HW1/raw_data2/
module: STAR_2.7.10a, RSEM v1.3.3, Salmon 1.5.1
index: /mnt/data/data/HW1/index/salmon_hg38_index,/mnt/data/data/HW1/index/rsem_hg38_index/hg38
genome: /mnt/data/data/HW1/genome/hg38/genes.gtf
# your bash code here

Problem 5: Speed Comparison

Report the relative speed of STAR+RSEM and Salmon for the analyses of the sample. Comment on your results based on the lecture material.

Problem 6:

Plot the relationship between effective length, normalized read counts, TPM, and FPKM for this sample from the RSEM and Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.

# your R code here

Please knit your homework to HTML file, and rename is as ‘03_Zhangsan_Homework1.html’.

Please submit your homework1 to http://140.210.214.122 before 23:59 of Jul/24/2021.