We will give you a simple example to test high throughput sequencing alignment for RNA-seq data. Normally for paired-end sequencing data, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. Read mapping could take a long time, so we have created just two FASTQ files of one RNA-seq sample with only 30,000 fragments (2 * 30,000 reads) for you to run STAR instead of the full data. The files are located on Huawei Cloud at /mnt/data/data/HW1/raw_data1. The mapping will generate one single output file.
Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the reference genome, available on Huawei Cloud at /mnt/data/data/HW1/index/star_hg38_index. Use the paired-end alignment mode and generate the output in SAM format. Please include full STAR report.How many reads are mappable and how many are uniquely mappable?
Sequencing data:
/mnt/data/data/HW1/raw_data1
module: STAR_2.7.9a
index: /mnt/data/data/HW1/index/star_hg38_index
# The raw data has copyed to ~/HW1/
# working path: ~/HW1/raw_data1/
/mnt/data/data/HW1/raw_data1/subA_l.fastq
STAR --runThreadN 8 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesIn /mnt/data/data/HW1/raw_data1/subA_l.fastq /mnt/data/data/HW1/raw_data1/subA_r.fastq
Results:
Number of input reads | 29988
Average input read length | 96
UNIQUE READS:
Uniquely mapped reads number | 19293
Uniquely mapped reads % | 64.34%
Average mapped length | 95.61
Number of splices: Total | 2694
Number of splices: Annotated (sjdb) | 2656
Number of splices: GT/AG | 2667
Number of splices: GC/AG | 19
Number of splices: AT/AC | 2
Number of splices: Non-canonical | 6
Mismatch rate per base, % | 0.15%
Deletion rate per base | 0.01%
Deletion average length | 1.50
Insertion rate per base | 0.01%
Insertion average length | 1.28
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 8900
% of reads mapped to multiple loci | 29.68%
Number of reads mapped to too many loci | 79
% of reads mapped to too many loci | 0.26%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 1675
% of reads unmapped: too short | 5.59%
Number of reads unmapped: other | 41
% of reads unmapped: other | 0.14%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. To save time, we have provided four bam file generated by STAR in /mnt/data/data/HW1/raw_data2.
Please use RSeQC (Liguo Wang et al, Bioinformatics 2012) geneBody_coverage.py and tin.py modules to determine whether any of the samples exhibit unusual quality control metrics. To expedite the process, you can use housekeeping genes as reference, which provided in /mnt/data/data/HW1/raw_data2. Overall, identify the best and worst libraries. Your answer should include screen shots and tables as necessary as if you were delivering a report to the collaborator.
data:
/mnt/data/data/HW1/raw_data2/bamFile
module:RSeQC v4.0.0
# working path: ~/HW1/raw_data2/
geneBody_coverage.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam -o Res1_
tin.py -r ../hg38.HouseKeepingGenes.nochr.bed -i res_WAligned.sortedByCoord.out.bam,res_XAligned.sortedByCoord.out.bam,res_YAligned.sortedByCoord.out.bam,res_ZAligned.sortedByCoord.out.bam
| Bam_file | TIN.mean. | TIN.median. | TIN.stdev. |
|---|---|---|---|
| res_WAligned.sortedByCoord.out.bam | 45.74958 | 45.48953 | 21.33583 |
| res_XAligned.sortedByCoord.out.bam | 53.78790 | 55.81712 | 20.62771 |
| res_YAligned.sortedByCoord.out.bam | 56.20032 | 58.76576 | 19.80363 |
| res_ZAligned.sortedByCoord.out.bam | 24.55179 | 21.92499 | 14.20632 |
In this case, we can identify that Z is the worst sample.
One output of RSeQC is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 4 samples together to quickly perform quality control. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (2 rows x 2 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Please identify the worst sample by your result.
install.packages('reticulate')
library('reticulate')
f = open('Res_.geneBodyCoverage.r')
y = []
name=['Z','W','X','Y']
for line in f:
if line.startswith('res_'):
s = line[36:-2]
yy = s.split(',')
yyy=[]
for e in yy:
yyy.append(float(e))
y.append(yyy)
else:
break
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,100,100)
for i in range(len(y)):
plt.subplot((221+i))
plt.plot(x,y[i],'g-')
plt.title(name[i])
plt.xlabel("Gene body percentile (5'->3')")
plt.ylabel('Coverage')
plt.tight_layout()
with PdfPages('P3.pdf') as pp:
pp.savefig()
Transcript quantification plays an important role in the analysis of RNA-seq data. A large number of tools have been developed to quantify expression at the transcript level. RSEM (Bo Li et al, BMC Bioinformatics 2011) is a software package for estimating gene and isoform expression levels from single-end or paired-end RNA-Seq data, it can perform the alignment step with three different aligners: bowtie, bowtie2, or STAR. Salmon (Rob Patro et al, Nature Methods 2017) is an ultra-fast alignment-free methods, which also can correct for GC-bias.
Please run STAR+RSEM and Salmon on one good quality sample based on previous discussion to get FPKM and TPM. Identify the transcript and gene with the highest expression in this library from the Salmon output.
data: /mnt/data/data/HW1/raw_data2/
module: STAR_2.7.9a, RSEM v1.3.3, Salmon 1.5.1
index: /mnt/data/data/HW1/index/salmon_hg38_index,/mnt/data/data/HW1/index/rsem_hg38_index
# STAR+RSEM:
STAR --runThreadN 10 --genomeDir /mnt/data/data/HW1/index/star_hg38_index/ --readFilesCommand zcat --readFilesIn runW.fastq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix runW. --quantMode TranscriptomeSAM GeneCounts
rsem-calculate-expression --alignments -p 10 runW.Aligned.toTranscriptome.out.bam /mnt/data/data/HW1/index/rsem_hg38_index runW_rsem
#Salmon:
salmon quant -i /mnt/data/data/HW1/index/salmon_hg38_index/ -l A -r runW.fastq.gz -g /mnt/data/data/HW1/genome/hg38/genes.gtf -p 10 --validateMappings -o salmon
RSEM:
| gene_id | transcript_id.s. | length | effective_length | expected_count | TPM | FPKM | |
|---|---|---|---|---|---|---|---|
| 17235 | ENSG00000198938 | ENST00000362079 | 784 | 735 | 167826.73 | 114450.37 | 109228.84 |
| 17111 | ENSG00000198712 | ENST00000361739 | 684 | 635 | 50703.12 | 40022.51 | 38196.58 |
| 17138 | ENSG00000198763 | ENST00000361453 | 1042 | 993 | 56195.37 | 28365.78 | 27071.65 |
| 17157 | ENSG00000198804 | ENST00000361624 | 1542 | 1493 | 59225.35 | 19883.43 | 18976.29 |
| 17212 | ENSG00000198899 | ENST00000361899 | 681 | 632 | 25035.56 | 19855.62 | 18949.76 |
Salmon:
| Name | Length | EffectiveLength | TPM | NumReads | FPKM | |
|---|---|---|---|---|---|---|
| 5066 | ENSG00000228253 | 207.000 | 9.766 | 115399.0 | 3922.0 | 187856.05 |
| 11078 | ENSG00000198938 | 784.000 | 534.000 | 88014.5 | 163563.0 | 143277.55 |
| 2582 | ENSG00000212907 | 297.000 | 48.689 | 33663.4 | 5704.0 | 54800.31 |
| 11909 | ENSG00000198712 | 684.000 | 434.000 | 33398.8 | 50444.0 | 54369.35 |
| 4595 | ENSG00000234745 | 287.402 | 128.130 | 25325.7 | 11292.8 | 41227.32 |
Partial results is presented here. The results have been sorted, thus the first line of the tables show the highest expression gene.
Notice: The results of RSEM is not as same as the results of Salmon. Focus on the relationship between length and effectivelength, we found that the effectivelength is much more smaller in the first line of Salmon's results.
Report the relative speed of STAR+RSEM and Salmon for the analyses of the sample. Comment on your results based on the lecture material.
| method | time |
|---|---|
| STAR+RSEM | 42sec+154sec (STAR+RSEM) |
| Salmon | 61.435sec |
In this case, we found that Salmon is faster than STAR+RSEM.
>Each step was computered by 10 threads<
Since STAR mapping in the whole genome while Salmon only aligning with the known transcripts, and in the human genome, the coding sequence accounts for a very low proportion, Salmon will naturally be much faster.
Plot the relationship between effective length, normalized read counts, TPM, and FPKM for this sample from the RSEM and Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.
rsem_data = read.table('runW_rsem.genes.results',header = TRUE)
salmon_data = read.table('quant.genes.sf',header = TRUE)
countToFpkm <- function(counts, effLen)
{
N <- sum(counts)
exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}
salmon_data$FPKM <- countToFpkm(salmon_data$NumReads,salmon_data$EffectiveLength)
plot(rsem_data$TPM,rsem_data$FPKM,type = 'b', xlab = 'TPM',ylab = 'FPKM', main = 'TPM & FPKM (RSEM)')
plot(rsem_data$TPM,(rsem_data$expected_count/rsem_data$effective_length),type = 'b', xlab = 'TPM',ylab = 'Counts/Length', main = 'TPM & Counts/Length (RSEM)')
rsem_data <- na.omit(rsem_data)
sub_rsem <- rsem_data[which(rsem_data$length >800 & rsem_data$length <2000 & rsem_data$expected_count > 0),]
plot(sub_rsem$TPM,sub_rsem$expected_count,type = 'p', xlab = 'TPM',ylab = 'Counts', main = 'TPM & Counts (RSEM)')
ssub_rsem <- rsem_data[which(rsem_data$length >500 & rsem_data$length <15000),]
plot((ssub_rsem$TPM/ssub_rsem$expected_count),ssub_rsem$effective_length,type = 'p', xlab = 'TPM/Counts', ylab = 'Length', main = 'TPM/Counts & Length (RSEM)')
Here we use the results from RSEM, and in some cases, only a subset of the results is used for plotting purposes.
The TPM and FPKM are proportional.
The TPM and Counts/Length are proportional.
The TPM and Counts have a positive correlation.
The TPM/Counts and Length are inversely proportional.