Problem 1: STAR alignment

We will give you a simple example to test high throughput sequencing alignment for RNA-seq data. Normally for paired-end sequencing data, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. Read mapping could take a long time, so we have created just two FASTQ files of one RNA-seq sample with only 30,000 fragments (2 * 30,000 reads) for you to run STAR instead of the full data. The files are located on Huawei Cloud at /mnt/data/data/HW1/raw_data1. The mapping will generate one single output file.

Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the reference genome, available on Huawei Cloud at /mnt/data/data/HW1/index/star_hg38_index. Use the paired-end alignment mode and generate the output in SAM format. Please include full STAR report.How many reads are mappable and how many are uniquely mappable?

Sequencing data:
/mnt/data/data/HW1/raw_data1

module: STAR_2.7.10a
index: /mnt/data/data/HW1/index/star_hg38_index
#  working path: ~/

STAR --runThreadN 8 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesIn /mnt/data/data/HW1/raw_data1/subA_l.fastq /mnt/data/data/HW1/raw_data1/subA_r.fastq  --outFileNamePrefix ./liangyuwei_w1_1

cat liangyuwei_w1_1Log.final.out
Results:

                          Number of input reads |       29988
                      Average input read length |       96
                                    UNIQUE READS:
                   Uniquely mapped reads number |       19293
                        Uniquely mapped reads % |       64.34%
                          Average mapped length |       95.61
                       Number of splices: Total |       2694
            Number of splices: Annotated (sjdb) |       2656
                       Number of splices: GT/AG |       2667
                       Number of splices: GC/AG |       19
                       Number of splices: AT/AC |       2
               Number of splices: Non-canonical |       6
                      Mismatch rate per base, % |       0.15%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.50
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.28
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       8900
             % of reads mapped to multiple loci |       29.68%
        Number of reads mapped to too many loci |       79
             % of reads mapped to too many loci |       0.26%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       1675
                 % of reads unmapped: too short |       5.59%
                Number of reads unmapped: other |       41
                     % of reads unmapped: other |       0.14%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Problem 2: RNA-seq quality control

You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. To save time, we have provided four bam file generated by STAR in /mnt/data/data/HW1/raw_data2.

Please use RSeQC (Liguo Wang et al, Bioinformatics 2012) geneBody_coverage.py and tin.py modules to determine whether any of the samples exhibit unusual quality control metrics. To expedite the process, you can use housekeeping genes as reference, which provided in /mnt/data/data/HW1/raw_data2. Overall, identify the best and worst libraries. Your answer should include screen shots and tables as necessary as if you were delivering a report to the collaborator.

data:
/mnt/data/data/HW1/raw_data2/bamFile
module:RSeQC v4.0.0
# working path: ~/

geneBody_coverage.py -r /mnt/data/data/HW1/raw_data2/hg38.HouseKeepingGenes.nochr.bed -i /mnt/data/data/HW1/raw_data2/bamFile  -o Res_

tin.py -r /mnt/data/data/HW1/raw_data2/hg38.HouseKeepingGenes.nochr.bed -i /mnt/data/data/HW1/raw_data2/bamFile 
Bam_file TIN.mean. TIN.median. TIN.stdev.
res_WAligned.sortedByCoord.out.bam 45.74958 45.48953 21.33583
res_XAligned.sortedByCoord.out.bam 53.78790 55.81712 20.62771
res_YAligned.sortedByCoord.out.bam 56.20032 58.76576 19.80363
res_ZAligned.sortedByCoord.out.bam 24.55179 21.92499 14.20632

In this case, we can identify that Z is the worst sample, W is the best sample.

Problem 3: Python programming

One output of RSeQC is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 4 samples together to quickly perform quality control. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (2 rows x 2 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Please identify the worst sample by your result.

f=open('Res1_.geneBodyCoverage.r')

y=[]
name=['Z','W','X','Y']
for line in f:
    if line.startswith('res_'):
        s = line[36:-2]
        yy = s.split(',')
        yyy=[]
        for e in yy:
            yyy.append(float(e))
        y.append(yyy)
    else:
        break

from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0,100,100)

for i in range(len(y)):
    plt.subplot((221+i))
    plt.plot(x,y[i],'g-')
    plt.title(name[i])
    plt.xlabel("Gene body percentile (5'->3')")
    plt.ylabel('Coverage')
    plt.tight_layout()

with PdfPages('Q3.pdf') as pp:
    pp.savefig()

In this case, we can identify that Z is the worst sample.

Problem 4: RNA-seq quantification

Transcript quantification plays an important role in the analysis of RNA-seq data. A large number of tools have been developed to quantify expression at the transcript level. RSEM (Bo Li et al, BMC Bioinformatics 2011) is a software package for estimating gene and isoform expression levels from single-end or paired-end RNA-Seq data, it can perform the alignment step with three different aligners: bowtie, bowtie2, or STAR. Salmon (Rob Patro et al, Nature Methods 2017) is an ultra-fast alignment-free methods, which also can correct for GC-bias.

Please run STAR+RSEM and Salmon on one good quality sample based on previous discussion to get FPKM and TPM. Identify the transcript and gene with the highest expression in this library from the Salmon output.

data: /mnt/data/data/HW1/raw_data2/
module: STAR_2.7.10a, RSEM v1.3.3, Salmon 1.5.1
index: /mnt/data/data/HW1/index/salmon_hg38_index,/mnt/data/data/HW1/index/rsem_hg38_index/hg38
genome: /mnt/data/data/HW1/genome/hg38/genes.gtf
# your bash code here

Problem 5: Speed Comparison

Report the relative speed of STAR+RSEM and Salmon for the analyses of the sample. Comment on your results based on the lecture material.

Problem 6:

Plot the relationship between effective length, normalized read counts, TPM, and FPKM for this sample from the RSEM and Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.

# your R code here

Please knit your homework to HTML file, and rename is as ‘03_Zhangsan_Homework1.html’.

Please submit your homework1 to http://140.210.214.122 before 23:59 of Jul/24/2021.