TagSeq utilities (https://github.com/Eli-Meyer/TagSeq_utilities) were used for this analysis. The scripts were installed in the local machine following the author’s instruction.
If you have docker installed, I have created a docker image containing all the required software to analyse TagSeq data. To run it just do:
docker pull fmobegi/tagseq_utils:1.0.0-beta
docker run \
--user $(id -u):$(id -g) \
-v $(pwd):/work_dir \
fmobegi/tagseq_utils:1.0.0-beta \
SCRIPT_TO_RUN \
[options]
For example:
docker run \
--user $(id -u):$(id -g) \
-v $(pwd):/work_dir \
fmobegi/tagseq_utils:1.0.0-beta \
SAMFilterByGene.pl \
-i sample.sam \
-o sample.filtered.sam
Running each script with --help provides the full array of input parameters.
TRY:
docker run \
--user $(id -u):$(id -g) \
-v $(pwd):/work_dir \
fmobegi/tagseq_utils:1.0.0-beta \
SAMFilterByGene.pl --help
NB: If you do not have sudo privileges on the system where you’re running the scripts from, try pulling the image as a singularity container and run normally without root rights
To run the scripts, first prepare the input fastq.
The scripts provided cannot process gzipped files. 01. First, copy files to working directory (cp SA*/*.gz /data/RNAseqDir/ && cd /data/RNAseqDir/), rename samples (rename 's/_R1_001//g' *.gz) and, unzip them (parallel --gnu gunzip ::: *gz)
First, reads were filtered based on quality score Q \(\geq\) 20 and LQ \(\leq\) 10. Reads that passed this step were then depleted of homo-polymer repeats and adapter sequences.
After removing PCR duplicates and trimming the TAGS, reads were mapped to various reference genomes using STARv.2.7.3a, BBMapv.38.92, and BWAv.0.7.17-r1188. The alignments in BAM format were sorted and indexed using SAMtoolsv.1.10 running htslibv.1.10.2-3. Transcripts were counted using HTSeq-countv.0.11.2 and StringTiev.2.1.1.
The quality of reads was assessed before and after QC using FastQCv.0.11.9 and reports compiled using MultiQCv.1.11.
NB: Removal of PCR duplicates led to massive loss in reads from ~3M-4M, to 0.5M-1.25M reads. This in turn affected the final transcripts count. We opted to skip this optional step while re-analyzing the data. Trimming is also not necessary when using local aligners.
These steps have been stitched together in a dummy Nexflow workflow here.
The script can be executed like:
nextflow run -resume quantify_Tagseq.nf \
--transcriptome "*.fastq" \
--cores 14 \
--adaptors 'adaptors.fa' \
--outdir 'results_TagSeq'
## Aligners
#...............................................
bwa mem \
-t 8 \
'${params.reference}' fqreads.af.fq.gz | \
samtools sort \
-@8 \
-o '${sampleID}.bam' -
samtools index \
'${sampleID}'.bam \
'${sampleID}'.bam.bai
STAR \
--runThreadN 14 \
--runMode genomeGenerate \
--genomeDir '${sampleID}' \
--genomeFastaFiles '${params.reference}' \
--sjdbGTFfile '${params.gtf}' \
--sjdbOverhang 99
STAR \
--runThreadN 14
--genomeDir '${params.reference}' \
--readFilesIn '${sampleID}'.tt.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix '${sampleID}' \
--sjdbGTFtagExonParentTranscript Parent
hisat2 \
-q \
--time \
--summary-file '${sampleID}'.sf.txt \
--met-file '${sampleID}'.mf.txt \
--threads 8 -x '${params.reference}' \
-U '${sampleID}'.tt.fq.gz \
--dta-cufflinks | \
tee \
>(samtools flagstat - > test.flagstat) | \
samtools sort \
-O BAM | \
tee \
'${sampleID}.bam' | \
samtools index - \
'${sampleID}'.bam.bai
gmapper \
--qv-offset 33 \
-Q \
--strata \
-o 3 \
-N 1 \
'${sampleID}'.tt.fq \
'${params.reference}' \
> '${sampleID}'.bwa.bam
## Counters
#.....................................................
stringtie \
-e \
-v \
-p 14 \
-e \
-o '${sampleID}'.gtf \
-G '${params.gff3}' \
-A '${sampleID}'.ga \
-l '${sampleID}' \
'${sampleID}'.bam
SAMFilterByGene.pl \
-i '${sampleID}'.bwa.bam \
-m 40 \
-o '${sampleID}'.filtered.sam
htseq-count \
--format=bam \
--stranded=no \
--type=CDS \
--order=pos \
--idattr=Name \
'${sampleID}'.bwa.bam \
'${params.gff3}' > \
'${sampleID}'.ht_seq.tsv
## Combinng HTSeq and SAMFilterByGene.pl count tab files
CombineExpression.pl all_counts_tab > abundance.matrix.tsv
## Combining stingtie files.
for i in *.ga; do
awk -F"\t" '{if (NR!=1) {print $1, $8}}' OFS='\t' $i > ${i/.ga/.fpkm};
awk -F"\t" '{if (NR!=1) {print $1, $9}}' OFS='\t' $i > ${i/.ga/.tpm};
echo -e ${i/.ga/}"\t"./${i/.ga/.gtf} > gtf_files;
prepDE.py -i gtf_files -g ${i/.ga/.raw.pre};
done
ls *.pre | xargs -I {} -exec bash -c 'grep -v gene_id "{}" | tr \, \\t > "{}".try'
rename 's/.raw.pre.try/.raw/g' *.try
create-gem.py --sources . --prefix reference_preifx --type raw
create-gem.py --sources . --prefix reference_preifx --type FPKM
create-gem.py --sources . --prefix reference_preifx --type TPM
Unfortunately, only stringtie provides normalised counts in form of FPKM and TPM in addition to RAW counts. The other two methods need postprocessing in R to obtain normalised reads.
These files are suitable for differential gene expression analysis using R Bioconductor packages DESeq2, edgeR, and Limma-Voom.
………………. THE END ………………….
Author: Fredrick M. Mobegi, PhD
Created: 18-08-2021 Wed 07:30h
Copyright © 2020 Fredrick Mobegi | This notebook is for reference purposes only and may contain links to embargoed or legally privileged data.