STAR (Spliced Transcripts Alignment to a Reference)
Default parameters?
$rcloneDir/rclone lsf --config="$rcloneDir/nygc_sample.conf" nygc_sample:<sample_name>
samtools view -b -h -T /hpc/hers_en/ywang/Reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa $datadir/cram/<sample> > <sample>.bam
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz
gunzip Homo_sapiens.GRCh38.99.chr.gtf.gz
CRIES code for GTF annotation files. Release 99
This code outputs locations of exons and introns that are supported by both Ensembl and Havana.
# This is the source annotation file
# Note that the annotation file should be in Ensembl format (downloaded from Ensembl FTP)
# Specifically, the chromosome names should lack the "chr" prefix
# Also, the first tag in column 9 must be gene_id, and the third tag must be transcript_id
# Also, the source annotation file must have a transcript_source tage in column 9
input="Homo_sapiens.GRCh38.99.chr.gtf"
# 1. Identify transcripts that are supported by both Ensembl and Havana annotations
# 2. Identify and write constitutive exons, i.e. those that appear in all Ensembl/Havana isoforms of a gene
output="./Homo_sapiens.GRCh38.99.chr.consExons.gtf"
cat $input | grep 'transcript_source "ensembl_havana"' | awk -v FS='\t' '$3=="exon" { exonName=$1":"$4":"$5":"$7; split($9, fields, ";"); geneName=fields[1]; transcriptName=fields[3]; printf("%s\t%s\t%s\n",exonName,geneName,transcriptName); }' | sort | uniq | awk -v FS='\t' '{ eCount[$1]++; tCount[$3]++; exonHost[$1]=$2; if(tCount[$3]==1) gCount[$2]++; } END { for(i in eCount) if(eCount[i]==gCount[exonHost[i]]) { split(i,fields,":"); printf("%s\tensembl_havana\texon\t%s\t%s\t.\t%s\t.\t%s;\n",fields[1],fields[2],fields[3],fields[4],exonHost[i]); } }' > $output
# 1. Identify transcripts that are supported by both Ensembl and Havana annotations
# 2. Find all exons
# 3. Write the coordinate of intronic regions, i.e. the regions that separate two adjacent exons of the same gene
output="./Homo_sapiens.GRCh38.99.chr.Introns.gtf"
cat $input | grep 'transcript_source "ensembl_havana"' | awk -v FS='\t' '$3=="exon" { exonName=$1":"$4":"$5":"$7; split($9, fields, ";"); geneName=fields[1]; transcriptName=fields[3]; printf("%s\t%s\t%s\n",exonName,geneName,transcriptName); }' | sort | uniq | awk -v FS='\t' '{ eCount[$1]++; tCount[$3]++; exonHost[$1]=$2; if(tCount[$3]==1) gCount[$2]++; } END { for(i in eCount) { split(i,fields,":"); printf("%s\tensembl_havana\texon\t%s\t%s\t.\t%s\t.\t%s;\n",fields[1],fields[2],fields[3],fields[4],exonHost[i]); } }' | bedtools sort -i stdin | awk -v FS='\t' '{ if( last_exon[$9]==1 && (last_exon_end[$9]+1)<($4-1) ) printf("%s\t%s\tintron\t%i\t%i\t%s\t%s\t%s\t%s\n",$1,$2,last_exon_end[$9]+1,$4-1,$6,$7,$8,$9); last_exon[$9]=1; last_exon_end[$9]=$5; }' > $output
In CRIES Step 2 they only include reads with a mapping quality > 30. Include -a 30 in HTSeq count? -a corresponds to alignment quality. Default is 10.
We use HTSeq-count for counting reads. For counting exonic reads, we run the HTSeq-count using the “intersection-strict” mode, to ensure that the reads that are counted entirely fit within the mature mRNA sequence, and do not overlap alternatively spliced exons. For intronic reads, we run the HTSeq-count using the “union” mode, since any read that even partially overlaps an intronic region likely originates from pre-mature RNA.
htseq-count -m intersection-strict -f bam -t exon -s <strand> <sampleName>.srt.filtered.bam ./Homo_sapiens.GRCh38.87.chr.consExons.gtf > <sampleName>".consExons.counts.txt"
htseq-count -m union -f bam -t intron -s <strand> <sampleName>.srt.filtered.bam ./Homo_sapiens.GRCh38.87.chr.Introns.gtf > <sampleName>".Introns.counts.txt"
The parameter
depends on the strandedness of the library preparation kit. For Illumina TruSeq RNA Library Prep Kit, the correct parameter is often reverse. If not sure, try all the three different options no, yes and reverse, and decide accordingly. Note that for paired-end reads, -r name must be used along with BAM files that are sorted by read name.
Code in main Rmd
Stringency 0.99 -> 2019 genes in the ‘test’ analysis on 15 samples
biasMode Linear: Currently, only available.
cd /hpc/hers_en/wdenooijer/rnadyn/scripts/REMBRANDTS/
conda activate rembrandts
module load R
sh ./REMBRANDTS.sh test /hpc/hers_en/wdenooijer/rnadyn/data/metadata.tab /hpc/hers_en/wdenooijer/rnadyn/data/countsdata 0.99 linear