Problem 1: STAR alignment

We will give you a simple example to test high throughput sequencing alignment for RNA-seq data. Normally for paired-end sequencing data, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. Read mapping could take a long time, so we have created just two FASTQ files of one RNA-seq sample with only 30,000 fragments (2 * 30,000 reads) for you to run STAR instead of the full data. The files are located on Huawei Cloud at /mnt/data/data/HW1/raw_data1. The mapping will generate one single output file.

Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the reference genome, available on Huawei Cloud at /mnt/data/data/HW1/index/star_hg38_index. Use the paired-end alignment mode and generate the output in SAM format. Please include full STAR report.How many reads are mappable and how many are uniquely mappable?

Sequencing data:
/mnt/data/data/HW1/raw_data1

module: STAR_2.7.9a
index: /mnt/data/data/HW1/index/star_hg38_index
# The raw data has copyed to ~/HW1/
# working path: ~/HW1/raw_data1/

/mnt/data/data/HW1/raw_data1/subA_l.fastq

STAR --runThreadN 8 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesIn /mnt/data/data/HW1/raw_data1/subA_l.fastq /mnt/data/data/HW1/raw_data1/subA_r.fastq
 Results:
                          Number of input reads |   29988
                      Average input read length |   96
                                    UNIQUE READS:
                   Uniquely mapped reads number |   19293
                        Uniquely mapped reads % |   64.34%
                          Average mapped length |   95.61
                       Number of splices: Total |   2694
            Number of splices: Annotated (sjdb) |   2656
                       Number of splices: GT/AG |   2667
                       Number of splices: GC/AG |   19
                       Number of splices: AT/AC |   2
               Number of splices: Non-canonical |   6
                      Mismatch rate per base, % |   0.15%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.50
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.28
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   8900
             % of reads mapped to multiple loci |   29.68%
        Number of reads mapped to too many loci |   79
             % of reads mapped to too many loci |   0.26%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   1675
                 % of reads unmapped: too short |   5.59%
                Number of reads unmapped: other |   41
                     % of reads unmapped: other |   0.14%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Problem 2: RNA-seq quality control

You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. To save time, we have provided four bam file generated by STAR in /mnt/data/data/HW1/raw_data2.

Please use RSeQC (Liguo Wang et al, Bioinformatics 2012) geneBody_coverage.py and tin.py modules to determine whether any of the samples exhibit unusual quality control metrics. To expedite the process, you can use housekeeping genes as reference, which provided in /mnt/data/data/HW1/raw_data2. Overall, identify the best and worst libraries. Your answer should include screen shots and tables as necessary as if you were delivering a report to the collaborator.

data:
/mnt/data/data/HW1/raw_data2/bamFile
module:RSeQC v4.0.0
# working path: ~/HW1/raw_data2/

geneBody_coverage.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam -o Res1_

tin.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam
Bam_file TIN.mean. TIN.median. TIN.stdev.
res_WAligned.sortedByCoord.out.bam 45.74958 45.48953 21.33583
res_XAligned.sortedByCoord.out.bam 53.78790 55.81712 20.62771
res_YAligned.sortedByCoord.out.bam 56.20032 58.76576 19.80363
res_ZAligned.sortedByCoord.out.bam 24.55179 21.92499 14.20632

In this case, we can identify that Z is the worst sample.

Problem 3: Python programming

One output of RSeQC is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 4 samples together to quickly perform quality control. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (2 rows x 2 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Please identify the worst sample by your result.

install.packages('reticulate')
library('reticulate')

f = open('Res_.geneBodyCoverage.r')

y = []
name=['Z','W','X','Y']
for line in f:
    if line.startswith('res_'):
        s = line[36:-2]
        yy = s.split(',')
        yyy=[]
        for e in yy:
            yyy.append(float(e))
        y.append(yyy)
    else:
        break

from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0,100,100)

for i in range(len(y)):
    plt.subplot((221+i))
    plt.plot(x,y[i],'g-')
    plt.title(name[i])
    plt.xlabel("Gene body percentile (5'->3')")
    plt.ylabel('Coverage')
    plt.tight_layout()

with PdfPages('P3.pdf') as pp:
    pp.savefig()

Problem 4: RNA-seq quantification

Transcript quantification plays an important role in the analysis of RNA-seq data. A large number of tools have been developed to quantify expression at the transcript level. RSEM (Bo Li et al, BMC Bioinformatics 2011) is a software package for estimating gene and isoform expression levels from single-end or paired-end RNA-Seq data, it can perform the alignment step with three different aligners: bowtie, bowtie2, or STAR. Salmon (Rob Patro et al, Nature Methods 2017) is an ultra-fast alignment-free methods, which also can correct for GC-bias.

Please run STAR+RSEM and Salmon on one good quality sample based on previous discussion to get FPKM and TPM. Identify the transcript and gene with the highest expression in this library from the Salmon output.

data: /mnt/data/data/HW1/raw_data2/
module: STAR_2.7.9a, RSEM v1.3.3, Salmon 1.5.1
index: /mnt/data/data/HW1/index/salmon_hg38_index,/mnt/data/data/HW1/index/rsem_hg38_index
# STAR+RSEM:
STAR --runThreadN 10 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesCommand zcat --readFilesIn runW.fastq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix runW. --quantMode TranscriptomeSAM GeneCounts

rsem-calculate-expression --alignments -p 10 runW.Aligned.toTranscriptome.out.bam /mnt/data/data/HW1/index/rsem_hg38_index runW_rsem


#Salmon:
salmon quant -i /mnt/data/data/HW1/index/salmon_hg38_index/ -l A -r runW.fastq.gz -g /mnt/data/data/HW1/genome/hg38/genes.gtf -p 10 --validateMappings -o salmon

RSEM:

gene_id transcript_id.s. length effective_length expected_count TPM FPKM
17235 ENSG00000198938 ENST00000362079 784 735 167826.73 114450.37 109228.84
17111 ENSG00000198712 ENST00000361739 684 635 50703.12 40022.51 38196.58
17138 ENSG00000198763 ENST00000361453 1042 993 56195.37 28365.78 27071.65
17157 ENSG00000198804 ENST00000361624 1542 1493 59225.35 19883.43 18976.29
17212 ENSG00000198899 ENST00000361899 681 632 25035.56 19855.62 18949.76

Salmon:

Name Length EffectiveLength TPM NumReads FPKM
5066 ENSG00000228253 207.000 9.766 115399.0 3922.0 187856.05
11078 ENSG00000198938 784.000 534.000 88014.5 163563.0 143277.55
2582 ENSG00000212907 297.000 48.689 33663.4 5704.0 54800.31
11909 ENSG00000198712 684.000 434.000 33398.8 50444.0 54369.35
4595 ENSG00000234745 287.402 128.130 25325.7 11292.8 41227.32
Partial results is presented here. The results have been sorted, thus the first line of the tables show the highest expression gene.

Notice: The results of RSEM is not as same as the results of Salmon. Focus on the relationship between length and effectivelength, we found that the effectivelength is much more smaller in the first line of Salmon's results.

Problem 5: Speed Comparison

Report the relative speed of STAR+RSEM and Salmon for the analyses of the sample. Comment on your results based on the lecture material.

method time
STAR+RSEM 42sec+154sec (STAR+RSEM)
Salmon 61.435sec
In this case, we found that Salmon is faster than STAR+RSEM.
>Each step was computered by 10 threads<

Since STAR mapping in the whole genome while Salmon only aligning with the known transcripts, and in the human genome, the coding sequence accounts for a very low proportion, Salmon will naturally be much faster.

Problem 6:

Plot the relationship between effective length, normalized read counts, TPM, and FPKM for this sample from the RSEM and Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.

rsem_data = read.table('runW_rsem.genes.results',header = TRUE)
salmon_data = read.table('quant.genes.sf',header = TRUE)


countToFpkm <- function(counts, effLen)
{
  N <- sum(counts)
  exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

salmon_data$FPKM <- countToFpkm(salmon_data$NumReads,salmon_data$EffectiveLength)


plot(rsem_data$TPM,rsem_data$FPKM,type = 'b', xlab = 'TPM',ylab = 'FPKM', main = 'TPM & FPKM (RSEM)')

plot(rsem_data$TPM,(rsem_data$expected_count/rsem_data$effective_length),type = 'b', xlab = 'TPM',ylab = 'Counts/Length', main = 'TPM & Counts/Length (RSEM)')

rsem_data <- na.omit(rsem_data)
sub_rsem <- rsem_data[which(rsem_data$length >800 & rsem_data$length <2000 & rsem_data$expected_count > 0),]
plot(sub_rsem$TPM,sub_rsem$expected_count,type = 'p', xlab = 'TPM',ylab = 'Counts', main = 'TPM & Counts (RSEM)')

ssub_rsem <- rsem_data[which(rsem_data$length >500 & rsem_data$length <15000),]
plot((ssub_rsem$TPM/ssub_rsem$expected_count),ssub_rsem$effective_length,type = 'p', xlab = 'TPM/Counts', ylab = 'Length', main = 'TPM/Counts & Length (RSEM)')

Here we use the results from RSEM, and in some cases, only a subset of the results is used for plotting purposes.

The TPM and FPKM are proportional.
The TPM and Counts/Length are proportional.
The TPM and Counts have a positive correlation.
The TPM/Counts and Length are inversely proportional.