Bioinformatics 2021 HW

Problem 1: STAR alignment

We will give you a simple example to test high throughput sequencing alignment for RNA-seq data. Normally for paired-end sequencing data, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. Read mapping could take a long time, so we have created just two FASTQ files of one RNA-seq sample with only 30,000 fragments (2 * 30,000 reads) for you to run STAR instead of the full data. The files are located on Huawei Cloud at /mnt/data/data/HW1/raw_data1. The mapping will generate one single output file.

Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the reference genome, available on Huawei Cloud at /mnt/data/data/HW1/index/star_hg38_index. Use the paired-end alignment mode and generate the output in SAM format. Please include full STAR report.How many reads are mappable and how many are uniquely mappable?

Sequencing data:
/mnt/data/data/HW1/raw_data1

module: STAR_2.7.9a
index: /mnt/data/data/HW1/index/star_hg38_index

# The raw data has copyed to ~/HW1/
# working path: ~/HW1/raw_data1/

/mnt/data/data/HW1/raw_data1/subA_l.fastq

STAR --runThreadN 8 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesIn /mnt/data/data/HW1/raw_data1/subA_l.fastq /mnt/data/data/HW1/raw_data1/subA_r.fastq

 Results:
                          Number of input reads |   29988
                      Average input read length |   96
                                    UNIQUE READS:
                   Uniquely mapped reads number |   19293
                        Uniquely mapped reads % |   64.34%
                          Average mapped length |   95.61
                       Number of splices: Total |   2694
            Number of splices: Annotated (sjdb) |   2656
                       Number of splices: GT/AG |   2667
                       Number of splices: GC/AG |   19
                       Number of splices: AT/AC |   2
               Number of splices: Non-canonical |   6
                      Mismatch rate per base, % |   0.15%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.50
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.28
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   8900
             % of reads mapped to multiple loci |   29.68%
        Number of reads mapped to too many loci |   79
             % of reads mapped to too many loci |   0.26%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   1675
                 % of reads unmapped: too short |   5.59%
                Number of reads unmapped: other |   41
                     % of reads unmapped: other |   0.14%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Problem 2: RNA-seq quality control

You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. To save time, we have provided four bam file generated by STAR in /mnt/data/data/HW1/raw_data2.

Please use RSeQC (Liguo Wang et al, Bioinformatics 2012) geneBody_coverage.py and tin.py modules to determine whether any of the samples exhibit unusual quality control metrics. To expedite the process, you can use housekeeping genes as reference, which provided in /mnt/data/data/HW1/raw_data2. Overall, identify the best and worst libraries. Your answer should include screen shots and tables as necessary as if you were delivering a report to the collaborator.

data:
/mnt/data/data/HW1/raw_data2/bamFile
module:RSeQC v4.0.0

# working path: ~/HW1/raw_data2/

geneBody_coverage.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam -o Res1_

tin.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam

Bam_file	TIN.mean.	TIN.median.	TIN.stdev.
res_WAligned.sortedByCoord.out.bam	45.74958	45.48953	21.33583
res_XAligned.sortedByCoord.out.bam	53.78790	55.81712	20.62771
res_YAligned.sortedByCoord.out.bam	56.20032	58.76576	19.80363
res_ZAligned.sortedByCoord.out.bam	24.55179	21.92499	14.20632

In this case, we can identify that Z is the worst sample.

Problem 3: Python programming

One output of RSeQC is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 4 samples together to quickly perform quality control. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (2 rows x 2 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Please identify the worst sample by your result.

install.packages('reticulate')
library('reticulate')

f = open('Res_.geneBodyCoverage.r')

y = []
name=['Z','W','X','Y']
for line in f:
    if line.startswith('res_'):
        s = line[36:-2]
        yy = s.split(',')
        yyy=[]
        for e in yy:
            yyy.append(float(e))
        y.append(yyy)
    else:
        break

from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0,100,100)

for i in range(len(y)):
    plt.subplot((221+i))
    plt.plot(x,y[i],'g-')
    plt.title(name[i])
    plt.xlabel("Gene body percentile (5'->3')")
    plt.ylabel('Coverage')
    plt.tight_layout()

with PdfPages('P3.pdf') as pp:
    pp.savefig()

Problem 4: RNA-seq quantification

Transcript quantification plays an important role in the analysis of RNA-seq data. A large number of tools have been developed to quantify expression at the transcript level. RSEM (Bo Li et al, BMC Bioinformatics 2011) is a software package for estimating gene and isoform expression levels from single-end or paired-end RNA-Seq data, it can perform the alignment step with three different aligners: bowtie, bowtie2, or STAR. Salmon (Rob Patro et al, Nature Methods 2017) is an ultra-fast alignment-free methods, which also can correct for GC-bias.

Please run STAR+RSEM and Salmon on one good quality sample based on previous discussion to get FPKM and TPM. Identify the transcript and gene with the highest expression in this library from the Salmon output.

data: /mnt/data/data/HW1/raw_data2/
module: STAR_2.7.9a, RSEM v1.3.3, Salmon 1.5.1
index: /mnt/data/data/HW1/index/salmon_hg38_index,/mnt/data/data/HW1/index/rsem_hg38_index

# STAR+RSEM:
STAR --runThreadN 10 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesCommand zcat --readFilesIn runW.fastq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix runW. --quantMode TranscriptomeSAM GeneCounts

rsem-calculate-expression --alignments -p 10 runW.Aligned.toTranscriptome.out.bam /mnt/data/data/HW1/index/rsem_hg38_index runW_rsem


#Salmon:
salmon quant -i /mnt/data/data/HW1/index/salmon_hg38_index/ -l A -r runW.fastq.gz -g /mnt/data/data/HW1/genome/hg38/genes.gtf -p 10 --validateMappings -o salmon

RSEM:

	gene_id	transcript_id.s.	length	effective_length	expected_count	TPM	FPKM
17235	ENSG00000198938	ENST00000362079	784	735	167826.73	114450.37	109228.84
17111	ENSG00000198712	ENST00000361739	684	635	50703.12	40022.51	38196.58
17138	ENSG00000198763	ENST00000361453	1042	993	56195.37	28365.78	27071.65
17157	ENSG00000198804	ENST00000361624	1542	1493	59225.35	19883.43	18976.29
17212	ENSG00000198899	ENST00000361899	681	632	25035.56	19855.62	18949.76

Salmon:

	Name	Length	EffectiveLength	TPM	NumReads	FPKM
5066	ENSG00000228253	207.000	9.766	115399.0	3922.0	187856.05
11078	ENSG00000198938	784.000	534.000	88014.5	163563.0	143277.55
2582	ENSG00000212907	297.000	48.689	33663.4	5704.0	54800.31
11909	ENSG00000198712	684.000	434.000	33398.8	50444.0	54369.35
4595	ENSG00000234745	287.402	128.130	25325.7	11292.8	41227.32

Partial results is presented here. The results have been sorted, thus the first line of the tables show the highest expression gene.

Notice: The results of RSEM is not as same as the results of Salmon. Focus on the relationship between length and effectivelength, we found that the effectivelength is much more smaller in the first line of Salmon's results.

Problem 5: Speed Comparison

Report the relative speed of STAR+RSEM and Salmon for the analyses of the sample. Comment on your results based on the lecture material.

method	time
STAR+RSEM	42sec+154sec (STAR+RSEM)
Salmon	61.435sec

In this case, we found that Salmon is faster than STAR+RSEM.
>Each step was computered by 10 threads<

Since STAR mapping in the whole genome while Salmon only aligning with the known transcripts, and in the human genome, the coding sequence accounts for a very low proportion, Salmon will naturally be much faster.

Problem 6:

Plot the relationship between effective length, normalized read counts, TPM, and FPKM for this sample from the RSEM and Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.

rsem_data = read.table('runW_rsem.genes.results',header = TRUE)
salmon_data = read.table('quant.genes.sf',header = TRUE)


countToFpkm <- function(counts, effLen)
{
  N <- sum(counts)
  exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

salmon_data$FPKM <- countToFpkm(salmon_data$NumReads,salmon_data$EffectiveLength)


plot(rsem_data$TPM,rsem_data$FPKM,type = 'b', xlab = 'TPM',ylab = 'FPKM', main = 'TPM & FPKM (RSEM)')

plot(rsem_data$TPM,(rsem_data$expected_count/rsem_data$effective_length),type = 'b', xlab = 'TPM',ylab = 'Counts/Length', main = 'TPM & Counts/Length (RSEM)')

rsem_data <- na.omit(rsem_data)
sub_rsem <- rsem_data[which(rsem_data$length >800 & rsem_data$length <2000 & rsem_data$expected_count > 0),]
plot(sub_rsem$TPM,sub_rsem$expected_count,type = 'p', xlab = 'TPM',ylab = 'Counts', main = 'TPM & Counts (RSEM)')

ssub_rsem <- rsem_data[which(rsem_data$length >500 & rsem_data$length <15000),]
plot((ssub_rsem$TPM/ssub_rsem$expected_count),ssub_rsem$effective_length,type = 'p', xlab = 'TPM/Counts', ylab = 'Length', main = 'TPM/Counts & Length (RSEM)')

Here we use the results from RSEM, and in some cases, only a subset of the results is used for plotting purposes.

The TPM and FPKM are proportional.
The TPM and Counts/Length are proportional.
The TPM and Counts have a positive correlation.
The TPM/Counts and Length are inversely proportional.

Bioinformatics 2021 HW_1