NGS Week 11 Homework Assignment

Task 3: Align ChIP-seq reads to the human reference genome

Q3.1. Execute your script and report its contents for your answer [ 1 point ].

# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=download_reads
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load bwa/intel/0.7.17

bwa mem -M -t $SLURM_CPUS_PER_TASK /scratch/work/courses/BI7653/hw3.2022/hg38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.normalized.fa SRR7207011_filtered.fq > SRR7207011.sam

bwa mem -M -t $SLURM_CPUS_PER_TASK /scratch/work/courses/BI7653/hw3.2022/hg38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.normalized.fa SRR7207017_filtered.fq > SRR7207017.sam

bwa mem -M -t $SLURM_CPUS_PER_TASK /scratch/work/courses/BI7653/hw3.2022/hg38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.normalized.fa SRR7207089_filtered.fq > SRR7207089.sam

Q3.2: When your alignments from Q3.1 have completed, create a script that will conduct the following 3 steps to prepare the BAMs for downstream analysis. The expected output is 3 coordinate-sorted BAM files and 3 BAM index files. (note: the simplest approach is to just run each of the required commands below three times, once for SAM). Obviously, you will need to use appropriate output filenames at each step. Execute your script and report its contents for your answer. [ 3 points ].

Step 1: Converting each SAM to BAM

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=SAM_BAM
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load samtools/intel/1.14

samtools view -b -h SRR7207011.sam > SRR7207011.bam

samtools view -b -h SRR7207017.sam > SRR7207017.bam

samtools view -b -h SRR7207089.sam > SRR7207089.bam

#Converting SAM to BAM

Step 2: Coordinate sort reads with Picard tools

#!/bin/bash 
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=8:00:00
#SBATCH --mem=46GB
#SBATCH --job-name=PICARD
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load picard/2.23.8

java -Xmx44g -jar $PICARD_JAR SortSam \
INPUT=SRR7207011.bam \
OUTPUT=SRR7207011_picard.bam \
SORT_ORDER=coordinate \
TMP_DIR="${SLURM_JOBTMP}" \
MAX_RECORDS_IN_RAM=10000000 \
VALIDATION_STRINGENCY=LENIENT

java -Xmx44g -jar $PICARD_JAR SortSam \
INPUT=SRR7207017.bam \
OUTPUT=SRR7207017_picard.bam \
SORT_ORDER=coordinate \
TMP_DIR="${SLURM_JOBTMP}" \
MAX_RECORDS_IN_RAM=10000000 \
VALIDATION_STRINGENCY=LENIENT

java -Xmx44g -jar $PICARD_JAR SortSam \
INPUT=SRR7207089.bam \
OUTPUT=SRR7207089_picard.bam \
SORT_ORDER=coordinate \
TMP_DIR="${SLURM_JOBTMP}" \
MAX_RECORDS_IN_RAM=10000000 \
VALIDATION_STRINGENCY=LENIENT

Step 3: Create a BAM index file for each

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=8:00:00
#SBATCH --mem=46GB
#SBATCH --job-name=BAM_index
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load samtools/intel/1.14

samtools index SRR7207011_picard.bam SRR7207011_coordinatesorted.out.index

samtools index SRR7207017_picard.bam SRR7207017_coordinatesorted.out.index

samtools index SRR7207089_picard.bam SRR7207089_coordinatesorted.out.index

Q3.3: Execute your commands separately on all three BAMs in a slurm script to produce new BAMs with low mapping quality reads removed and their index files. Report your script for your answer [ 1 point ].

Filters out reads with mapping quality less than 20 using the -q option:

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=BAM_filter
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load samtools/intel/1.14

samtools view -b -h -q 20 SRR7207011_picard.bam > SRR7207011_filteredmp.bam

samtools view -b -h -q 20 SRR7207017_picard.bam > SRR7207017_filteredmp.bam

samtools view -b -h -q 20 SRR7207089_picard.bam > SRR7207089_filteredmp.bam

Creating an index file for each of the filtered q20 BAMs

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=BAM_index
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load samtools/intel/1.14

samtools index SRR7207011_filteredmp.bam SRR7207011_coordinatesorted_filterd.out.index

samtools index SRR7207017_filteredmp.bam SRR7207017_coordinatesorted_filterd.out.index

samtools index SRR7207089_filteredmp.bam SRR7207089_coordinatesorted_filterd.out.index

Q4.1 Execute your script and report its contents for your answer. Retain your outputs for Week 12 assignment [ 1 point ].

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=MACS2
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module load macs2/intel/2.2.7.1

macs2 callpeak -t SRR7207011_filteredmp.bam -c SRR7207089_filteredmp.bam -f BAM -g hs -n SRR7207011_filteredmp_macs2 -B -q 0.01

macs2 callpeak -t SRR7207017_filteredmp.bam -c SRR7207089_filteredmp.bam -f BAM -g hs -n SRR7207017_filteredmp_macs2 -B -q 0.01

Q4.2: Describe in your own words in three sentences or less the primary purpose of a transcription factor ChIP-seq experiment such as the one you are conducting this week and next [ 1 point ].

The main purpose of conducting a transcription factor ChIP-seq experiment is to identify, characterize and analyze regions or binding sites of the genome in which transcription factor binds to; this can also identify any DNA-binding proteins as well. Also, transcription factor ChIP-seq experiment can be utilized to identify the primary sequence of the DNA in which the transcription factor recognizes/binds to in the genome.

Q4.3: The human genome is highly repetitive. There are segmental duplications, low complexity sequence and transposable elements all of which may also harbor transcription factor binding sites. How reads map to these regions should always be a concern in NGS applications. In ChIP-seq it is common practice to remove multiply-mapped reads (‘multireads’), but this is not always advisable and in fact Chung et al. 2011 argue for not removing multiply-mapped reads. Review Chung et al. 2011 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002111 and select all statements below that are true. (note: most answers can be found in the Abstratct and Introduction) [ 1 point ].

#The statements below are correct/true:a,b,c and e

#a. the primary problem with multiply mapped reads is that you don't know where in the genome the reads actually map
#b. discarding multiply mapped reads could introduce false negatives (failure to detect true peaks)
#c. many multiply mapped reads map to segmental duplication regions
#e.including multiply-mapped reads in moderate to highly mappable regions can improve peak identification

NGS Week 11 Homework Assignment

Bonnie Lin

Task 1: Download reads using the sra toolkit.

Q1.1: Log in to Greene and create a slurm script that downloads the data. The script should load the most recent sra-tools module then download the three fastq file. Execute your script and report the contents and the names of the fastq files in your /scratch. [ 1 point ].

Task 2: Run fastp to process the reads

Task 3: Align ChIP-seq reads to the human reference genome

Q3.1. Execute your script and report its contents for your answer [ 1 point ].

Q3.3: Execute your commands separately on all three BAMs in a slurm script to produce new BAMs with low mapping quality reads removed and their index files. Report your script for your answer [ 1 point ].

Q4.1 Execute your script and report its contents for your answer. Retain your outputs for Week 12 assignment [ 1 point ].

Q4.2: Describe in your own words in three sentences or less the primary purpose of a transcription factor ChIP-seq experiment such as the one you are conducting this week and next [ 1 point ].