Task 1: Align RNA-seq reads to the reference genome

Q1.1: Report the contents of your array job script and the job id on Greene [ 3 points ].

Array job script:

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=16GB
#SBATCH --job-name=RNAseq_align
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu
#SBATCH --array=1-8

module load samtools/intel/1.14
module load star/intel/2.7.6a

table=/scratch/work/courses/BI7653/hw8.2022/fastqs.txt
  
line="$(head -n $SLURM_ARRAY_TASK_ID $table | tail -n 1)"
sample="$(printf "%s" "${line}" | cut -f1)"
fq1="$(printf "%s" "${line}" | cut -f2)"
fq2="$(printf "%s" "${line}" | cut -f3)"

mkdir $sample
cd $sample



STAR --runThreadN $SLURM_CPUS_PER_TASK \
     --genomeDir /scratch/work/courses/BI7653/hw8.2022/STAR.genome \
     --readFilesIn /scratch/work/courses/BI7653/hw8.2022/fastqs/${sample}_1PE.fastq.gz /scratch/work/courses/BI7653/hw8.2022/fastqs/${sample}_2PE.fastq.gz \
     --outFileNamePrefix $sample \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat \
     --limitBAMsortRAM 20000000000 \
     --outTmpDir ${SLURM_TMPDIR}/${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}

samtools index ${sample}Aligned.sortedByCoord.out.bam
        16762189_8        cm RNAseq_a   bl2477  R       0:13      1 cm003
        16762189_7        cm RNAseq_a   bl2477  R       0:13      1 cm002
        16762189_6        cm RNAseq_a   bl2477  R       0:13      1 cm001
        16762189_5        cm RNAseq_a   bl2477  R       0:13      1 cm029
        16762189_4        cm RNAseq_a   bl2477  R       0:13      1 cm029
        16762189_3        cm RNAseq_a   bl2477  R       0:13      1 cm029
        16762189_2        cm RNAseq_a   bl2477  R       0:13      1 cm028
        16762189_1        cm RNAseq_a   bl2477  R       0:13      1 cm028
#jobID on Greene

Q1.2: Review the file “Log.final.out” for sample PDAC253 and report the following [ 1 point ]:

1. The number of uniquely mapped reads= 5091972
2. The percentage of uniquely mapped reads=14.89%
3. The total number of input reads= 34202682

Q1.3a: What is the mapping quality used by STAR by default to indicate uniquely mapped reads?

By default, the mapping quality used by STAR to indicate uniquely mapped reads is mapping quality of 255.

Q1.3b:If you want to make sure STAR output only uniquely mapped reads, how might you do this (see documentation)?

To make sure STAR output only uniquely mapped reads, I can control the output by using –outFilterMultimapNmax 1 which should output reads that only maps to one locus. By default, N is set to 10, meaning that any read that maps to less or equal to 10 loci will be output. However, if I set the parameter of –outFilterMultimapNmax to 1 then STAR should only output uniquely mapped reads (reads that mapped to only one locus).

Q1.4: The number and percentage of reads mapped to too many loci is very high for PDAC253 sample. Provide a hypothesis for this observation and how you might go about evaluating it. [ 1 point ]

The number and percentage of reads mapped to too many loci is very high for PDAC253 sample which may be an indication of contamination from rRNA which is the result of insufficient depletion of rRNA during library preparation. rRNA contains many paralogs that are highly sequence-similar; therefore, if rRNA is not removed appropriately during library prep, the rRNAs can be mapped to several locations (multiple loci) and hence the high percentage of multiply-mapped reads from the rRNA contamination. Normally during library prep, rRNAs are depleted by different techniques but if the technique is not done correctly, it will result in insufficient rRNA depletion and thus high percentage of multipl-mapped reads which came from rRNAs mapping to several locations as most rRNAs contain many paralogs that are very similar in sequence. To evaluate if this high number/percentage of reads mapped to many loci is due to insufficient rRNA depletion, I would extract some of the sequences that are identified as mapping to many different loci and BLAST search these sequences to the rRNA database.

Q1.5: Report the first 20 lines of the header for one output BAM (using samtools view). Then answer is your BAM coordinate-sorted? Please include your samtools view command in your answer for full credit. [ 1 point ]?


module load samtools/intel/1.14
samtools view -H PDAC253Aligned.sortedByCoord.out.bam | head -n 20
# samtools view command to view the first 20 lines of the header for PDAC253 outout BAM

#First 20 lines of the header for PDAC253 output BAM:
@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:40814151
@SQ SN:chr2 LN:29301675
@SQ SN:chr3 LN:24755689
@SQ SN:chr4 LN:33281721
@SQ SN:chr5 LN:18619412
@SQ SN:chr6 LN:18596258
@SQ SN:chr7 LN:16639383
@SQ SN:chr8 LN:31698078
@SQ SN:chr9 LN:22757669
@SQ SN:chr10    LN:15825318
@SQ SN:chr11    LN:29487722
@SQ SN:chr12    LN:14769854
@SQ SN:chr13    LN:12891333
@SQ SN:chr14    LN:24628924
@SQ SN:chr15    LN:12030914
@SQ SN:chr16    LN:13553361
@SQ SN:chr17    LN:16126437
@SQ SN:chr18    LN:9812533
@SQ SN:000007F  LN:4728343

Yes, the bam file is coordinate-sorted based on the header SO:coordinate which indicate that.

Q1.6a: What mapping quality scores are present in the alignment for PDAC253 (note: you may need to convert BAM to SAM)? Include either your command line or how you answered this question in your answer [1 point]

samtools view -h PDAC253Aligned.sortedByCoord.out.bam > PDAC253Aligned.sortedByCoord.out.sam
#converting BAM file to SAM file using samtools 

Some of the mapping quality scores that are present in the alignment for PDAC253 are 255, 0, 1 and 3 (at least for the first couple of alignments)

Q1.6b: What does each of the observed mapping quality mean? (hint see STAR documentation) [ 1 point ]

Mapping quality of 255 indicates uniquely mapped reads while mapping quality of 3 indicates that the read maps to 2 locations. Mapping quality of 1 means the read maps to 4-9 locations while mapping quality of 0 means the read mapes to 10 (default) or more locations. There is also a mapping quality of 2 that did not seem to be present in PDAC253 but mapping quality of 2 means that the read maps to 3 locations. STAR uses a very similar mapping quality score scheme to Tophat except for uniquely mapped reads, STAR uses the default 255.

Q1.7: Imagine that you are working on a pair of recently duplicated genes and want to independently test for differential gene expression for the duplicated genes with the RNA-seq data in this assignment. Do you think this is possible? What factor(s) should be considered in order to do so? [ 1 point ]

Yes, this is possible by counting reads in the analysis step using tools such as htseq-count which can generate a matrix of counts per gene. In this case, if the pair of gene that is duplicated is not entirely identical, then we can the duplicated gene as two separate genes and generate the matrix of counts or reads mapped to each form of the duplicated gene. We can then conduct a statistical analysis of the matrix of counts generated by htseq-count and determine the differential gene expression (e.g. differential gene expression of each form of duplicated gene overall differential expression of the gene by combining them). If the duplicated genes are exactly the same in terms of the sequence, then we need to make sure that when we ned to take into account multiply-mapped reads since there would be two locations of this gene (duplicated gene) and would be identified as multiply-mapped reads when using STAR. We can then use htseq-count again to generate matrix of counts for the reads that is mapped to this gene and then determine the differential gene expression.