1 The tools

TagSeq utilities (https://github.com/Eli-Meyer/TagSeq_utilities) were used for this analysis. The scripts were installed in the local machine following the author’s instruction.

If you have docker installed, I have created a docker image containing all the required software to analyse TagSeq data. To run it just do:

docker pull fmobegi/tagseq_utils:1.0.0-beta
docker run \
    --user $(id -u):$(id -g) \
    -v $(pwd):/work_dir \
    fmobegi/tagseq_utils:1.0.0-beta \
    SCRIPT_TO_RUN \
    [options]
  

For example:

docker run \
    --user $(id -u):$(id -g) \
    -v $(pwd):/work_dir \
    fmobegi/tagseq_utils:1.0.0-beta \
    SAMFilterByGene.pl \
    -i sample.sam \
    -o sample.filtered.sam

Running each script with --help provides the full array of input parameters.

TRY:

docker run \
    --user $(id -u):$(id -g) \
    -v $(pwd):/work_dir \
    fmobegi/tagseq_utils:1.0.0-beta \
    SAMFilterByGene.pl --help
    

NB: If you do not have sudo privileges on the system where you’re running the scripts from, try pulling the image as a singularity container and run normally without root rights

To run the scripts, first prepare the input fastq.

The scripts provided cannot process gzipped files. 01. First, copy files to working directory (cp SA*/*.gz /data/RNAseqDir/ && cd /data/RNAseqDir/), rename samples (rename 's/_R1_001//g' *.gz) and, unzip them (parallel --gnu gunzip ::: *gz)

2 Preprocessing and quality control of raw 3′ Tag-RNA-Seq reads

2.1 Quality filtering and removal of adapters.

First, reads were filtered based on quality score Q \(\geq\) 20 and LQ \(\leq\) 10. Reads that passed this step were then depleted of homo-polymer repeats and adapter sequences.

2.2 Alignment to the reference genomes

After removing PCR duplicates and trimming the TAGS, reads were mapped to various reference genomes using STARv.2.7.3a, BBMapv.38.92, and BWAv.0.7.17-r1188. The alignments in BAM format were sorted and indexed using SAMtoolsv.1.10 running htslibv.1.10.2-3. Transcripts were counted using HTSeq-countv.0.11.2 and StringTiev.2.1.1.

The quality of reads was assessed before and after QC using FastQCv.0.11.9 and reports compiled using MultiQCv.1.11.

NB: Removal of PCR duplicates led to massive loss in reads from ~3M-4M, to 0.5M-1.25M reads. This in turn affected the final transcripts count. We opted to skip this optional step while re-analyzing the data. Trimming is also not necessary when using local aligners.

3 The workflow

These steps have been stitched together in a dummy Nexflow workflow here.

The script can be executed like:

nextflow run -resume quantify_Tagseq.nf \
    --transcriptome "*.fastq" \
    --cores 14 \
    --adaptors 'adaptors.fa' \
    --outdir 'results_TagSeq'

4 Assembly parameters used for different aligners and transcript counters


## Aligners 
#...............................................

bwa mem \
    -t 8 \
    '${params.reference}' fqreads.af.fq.gz | \
    samtools sort \
        -@8 \
        -o '${sampleID}.bam' -


samtools index \
    '${sampleID}'.bam \
    '${sampleID}'.bam.bai

    
STAR \
    --runThreadN 14 \
    --runMode genomeGenerate \
    --genomeDir '${sampleID}' \
    --genomeFastaFiles '${params.reference}' \
    --sjdbGTFfile '${params.gtf}' \
    --sjdbOverhang 99


STAR \
    --runThreadN 14
    --genomeDir '${params.reference}' \
    --readFilesIn '${sampleID}'.tt.fq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts \
    --outFileNamePrefix '${sampleID}' \
    --sjdbGTFtagExonParentTranscript Parent


hisat2 \
    -q \
    --time \
    --summary-file '${sampleID}'.sf.txt \
    --met-file '${sampleID}'.mf.txt \
    --threads 8 -x '${params.reference}' \
    -U '${sampleID}'.tt.fq.gz \
    --dta-cufflinks | \
    tee \
        >(samtools flagstat - > test.flagstat) | \
    samtools sort \
        -O BAM | \
    tee \
        '${sampleID}.bam' | \
    samtools index - \
        '${sampleID}'.bam.bai


gmapper \
    --qv-offset 33 \
    -Q \
    --strata \
    -o 3 \
    -N 1 \
    '${sampleID}'.tt.fq \
    '${params.reference}' \
    > '${sampleID}'.bwa.bam
    
## Counters
#.....................................................
stringtie \
    -e \
    -v \
    -p 14 \
    -e \
    -o '${sampleID}'.gtf \
    -G '${params.gff3}' \
    -A '${sampleID}'.ga \
    -l '${sampleID}' \
    '${sampleID}'.bam


SAMFilterByGene.pl \
    -i '${sampleID}'.bwa.bam \
    -m 40 \
    -o '${sampleID}'.filtered.sam


htseq-count \
    --format=bam \
    --stranded=no \
    --type=CDS \
    --order=pos \
    --idattr=Name \
    '${sampleID}'.bwa.bam \
    '${params.gff3}' > \
    '${sampleID}'.ht_seq.tsv

5 Post-processing of stringtie/htseq-count metrics

## Combinng HTSeq and SAMFilterByGene.pl count tab files

CombineExpression.pl all_counts_tab > abundance.matrix.tsv

## Combining stingtie files. 
for i in *.ga; do 
    awk -F"\t" '{if (NR!=1) {print $1, $8}}' OFS='\t' $i > ${i/.ga/.fpkm}; 
    awk -F"\t" '{if (NR!=1) {print $1, $9}}' OFS='\t' $i > ${i/.ga/.tpm}; 
    echo -e ${i/.ga/}"\t"./${i/.ga/.gtf} > gtf_files; 
    prepDE.py -i gtf_files -g ${i/.ga/.raw.pre};
done

ls *.pre | xargs -I {} -exec bash -c 'grep -v gene_id "{}" | tr \, \\t > "{}".try'
rename 's/.raw.pre.try/.raw/g' *.try
create-gem.py --sources . --prefix reference_preifx --type raw
create-gem.py --sources . --prefix reference_preifx --type FPKM
create-gem.py --sources . --prefix reference_preifx --type TPM

Unfortunately, only stringtie provides normalised counts in form of FPKM and TPM in addition to RAW counts. The other two methods need postprocessing in R to obtain normalised reads.

These files are suitable for differential gene expression analysis using R Bioconductor packages DESeq2, edgeR, and Limma-Voom.

………………. THE END ………………….

Author: Fredrick M. Mobegi, PhD
Created: 18-08-2021 Wed 07:30h
Copyright © 2020 Fredrick Mobegi | This notebook is for reference purposes only and may contain links to embargoed or legally privileged data.