Array job script:
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=16GB
#SBATCH --job-name=RNAseq_align
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu
#SBATCH --array=1-8
module load samtools/intel/1.14
module load star/intel/2.7.6a
table=/scratch/work/courses/BI7653/hw8.2022/fastqs.txt
line="$(head -n $SLURM_ARRAY_TASK_ID $table | tail -n 1)"
sample="$(printf "%s" "${line}" | cut -f1)"
fq1="$(printf "%s" "${line}" | cut -f2)"
fq2="$(printf "%s" "${line}" | cut -f3)"
mkdir $sample
cd $sample
STAR --runThreadN $SLURM_CPUS_PER_TASK \
--genomeDir /scratch/work/courses/BI7653/hw8.2022/STAR.genome \
--readFilesIn /scratch/work/courses/BI7653/hw8.2022/fastqs/${sample}_1PE.fastq.gz /scratch/work/courses/BI7653/hw8.2022/fastqs/${sample}_2PE.fastq.gz \
--outFileNamePrefix $sample \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--limitBAMsortRAM 20000000000 \
--outTmpDir ${SLURM_TMPDIR}/${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
samtools index ${sample}Aligned.sortedByCoord.out.bam
16762189_8 cm RNAseq_a bl2477 R 0:13 1 cm003
16762189_7 cm RNAseq_a bl2477 R 0:13 1 cm002
16762189_6 cm RNAseq_a bl2477 R 0:13 1 cm001
16762189_5 cm RNAseq_a bl2477 R 0:13 1 cm029
16762189_4 cm RNAseq_a bl2477 R 0:13 1 cm029
16762189_3 cm RNAseq_a bl2477 R 0:13 1 cm029
16762189_2 cm RNAseq_a bl2477 R 0:13 1 cm028
16762189_1 cm RNAseq_a bl2477 R 0:13 1 cm028
#jobID on Greene
1. The number of uniquely mapped reads= 5091972
2. The percentage of uniquely mapped reads=14.89%
3. The total number of input reads= 34202682
By default, the mapping quality used by STAR to indicate uniquely mapped reads is mapping quality of 255.
To make sure STAR output only uniquely mapped reads, I can control the output by using –outFilterMultimapNmax 1 which should output reads that only maps to one locus. By default, N is set to 10, meaning that any read that maps to less or equal to 10 loci will be output. However, if I set the parameter of –outFilterMultimapNmax to 1 then STAR should only output uniquely mapped reads (reads that mapped to only one locus).
The number and percentage of reads mapped to too many loci is very high for PDAC253 sample which may be an indication of contamination from rRNA which is the result of insufficient depletion of rRNA during library preparation. rRNA contains many paralogs that are highly sequence-similar; therefore, if rRNA is not removed appropriately during library prep, the rRNAs can be mapped to several locations (multiple loci) and hence the high percentage of multiply-mapped reads from the rRNA contamination. Normally during library prep, rRNAs are depleted by different techniques but if the technique is not done correctly, it will result in insufficient rRNA depletion and thus high percentage of multipl-mapped reads which came from rRNAs mapping to several locations as most rRNAs contain many paralogs that are very similar in sequence. To evaluate if this high number/percentage of reads mapped to many loci is due to insufficient rRNA depletion, I would extract some of the sequences that are identified as mapping to many different loci and BLAST search these sequences to the rRNA database.
module load samtools/intel/1.14
samtools view -H PDAC253Aligned.sortedByCoord.out.bam | head -n 20
# samtools view command to view the first 20 lines of the header for PDAC253 outout BAM
#First 20 lines of the header for PDAC253 output BAM:
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:40814151
@SQ SN:chr2 LN:29301675
@SQ SN:chr3 LN:24755689
@SQ SN:chr4 LN:33281721
@SQ SN:chr5 LN:18619412
@SQ SN:chr6 LN:18596258
@SQ SN:chr7 LN:16639383
@SQ SN:chr8 LN:31698078
@SQ SN:chr9 LN:22757669
@SQ SN:chr10 LN:15825318
@SQ SN:chr11 LN:29487722
@SQ SN:chr12 LN:14769854
@SQ SN:chr13 LN:12891333
@SQ SN:chr14 LN:24628924
@SQ SN:chr15 LN:12030914
@SQ SN:chr16 LN:13553361
@SQ SN:chr17 LN:16126437
@SQ SN:chr18 LN:9812533
@SQ SN:000007F LN:4728343
Yes, the bam file is coordinate-sorted based on the header SO:coordinate which indicate that.
samtools view -h PDAC253Aligned.sortedByCoord.out.bam > PDAC253Aligned.sortedByCoord.out.sam
#converting BAM file to SAM file using samtools
Some of the mapping quality scores that are present in the alignment for PDAC253 are 255, 0, 1 and 3 (at least for the first couple of alignments)
Mapping quality of 255 indicates uniquely mapped reads while mapping quality of 3 indicates that the read maps to 2 locations. Mapping quality of 1 means the read maps to 4-9 locations while mapping quality of 0 means the read mapes to 10 (default) or more locations. There is also a mapping quality of 2 that did not seem to be present in PDAC253 but mapping quality of 2 means that the read maps to 3 locations. STAR uses a very similar mapping quality score scheme to Tophat except for uniquely mapped reads, STAR uses the default 255.
Yes, this is possible by counting reads in the analysis step using tools such as htseq-count which can generate a matrix of counts per gene. In this case, if the pair of gene that is duplicated is not entirely identical, then we can the duplicated gene as two separate genes and generate the matrix of counts or reads mapped to each form of the duplicated gene. We can then conduct a statistical analysis of the matrix of counts generated by htseq-count and determine the differential gene expression (e.g. differential gene expression of each form of duplicated gene overall differential expression of the gene by combining them). If the duplicated genes are exactly the same in terms of the sequence, then we need to make sure that when we ned to take into account multiply-mapped reads since there would be two locations of this gene (duplicated gene) and would be identified as multiply-mapped reads when using STAR. We can then use htseq-count again to generate matrix of counts for the reads that is mapped to this gene and then determine the differential gene expression.