Task 1: Pre-processing fastq data for variant discovery

Q1.1a Paste the contents of your job script into your homework file. Please use an RMarkdown code blocl or quivalent (see Week 2 webinar when we will discuss RMarkdown) if possible.

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=4GB
#SBATCH --job-name=slurm_template
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=bl2477@nyu.edu

module load fastp/intel/0.20.1

fastp -i /scratch/work/courses/BI7653/hw2.2022/ERR156634_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/ERR156634_2.filt.fastq.gz -o fastpR1.out.fq.gz -O fastpR2.out.fq.gz -l 76 -n 50 --detect_adapter_for_pe

Q1.1b Report your job id

Job Id: 1480736

Q1.1c Check the exit status of your job usig the seff command. What is the exit code and what does it mean?

The exit code is 0 which means that the command that was initiated was ran successfully without encountering any errors.

Q1.2 A common source of confusion for students new to the command prompt is the distinction between relative and absolute file paths.

Q1.2a What is the difference?

The difference between a relative and absolute file path is that an absolute path is defined as specifying a file or directory from the root directory as defined by “/”. The absolute path is what is defined as the complete path from the root directory that always start with / directory. In comparison, the relative path is the path that is related with the current working directory that you are working in. For example if I were to access the ngs.week2 directory and I am currently in the /scratch/bl2477 directory:

Absolute path concept:

cd /scratch/bl2477/ngs.week2

Relative path concept:

cd ngs.week2

With the absolute path concept, I would access the ngs.week2 directory by starting at the root directory (/) and making my way to the directory that I want to access. In the relative path concept, I can directory access the ngs.week2 starting from the current directory I am in instead of starting from the root directory.

Q1.2b Did you use an absolute or relative path to read the fastq files in the course directory? Did you use an absolute or relative path to write the processed fastq files?

To read the fastq files or access the fastq files, I used an absolute path which starts with the root directory (/). I used the relative path to write the processed fastq files for the output files to be directly writen to the current working directory. For example fastpR1.out.fq.gz and fastpR2.out.fq.gz output files were directory written to the current directory using the relative path without describing where the current directory is and without starting from the root directory (/).

Q1.3 Confirm your job has completed and that you are working interactively at a Greene compute node. You may then answer the following questions using a combination of gunzip and the STDOUT of fastp. The fastp STDOUT contains logging information. If you did not specifically re-direct STDOUT to a file when running fastp then the STDOUT will be located in the file with the STDOUT for the job. Locate that file and then answer the following [ 1 point ].

Q1.3a What is the name of the STDOUT file for your job?

The name of the STDOUT file for my job is slurm-14807360.out

Q1.3b Provide an example command line you would use to redirect the STDOUT of fastp to a file name of your choosing instead of the default STDOUT file for the job? If you are working in RMarkdown or equivalent please add a code block with your answer.

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=4GB
#SBATCH --job-name=slurm_template
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=bl2477@nyu.edu
#SBATCH -o stdout.txt

module load fastp/intel/0.20.1

fastp -i /scratch/work/courses/BI7653/hw2.2022/ERR156634_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/ERR156634_2.filt.fastq.gz -o fastpR1.out.fq.gz -O fastpR2.out.fq.gz -l 76 -n 50 --detect_adapter_for_pe

#The SBATCH -o stdout.txt command in the job script will redirect the STDOUT to a file named stdout.txt

Q1.3c What percentage of the bases were Phred quality of Q30 or above in each of the original and processed fastqs?

Before processing:

Q30 bases (Read 1): 1654818604(94.3776%)
Q30 bases (Read 2): 1604981035(91.5352%)

After processing:

Q30 bases (Read 1): 1628465308(94.7822%)
Q30 bases (Read 2): 1588178177(92.4373%)

Q1.3d Report the “Filtering result” section of the output and the duplication rate. Notice the different filters that are applied by fastp.

Filtering result:
reads passed filter: 34367822
reads failed due to low quality: 541724
reads failed due to too many N: 0
reads failed due to too short: 158504
reads with adapter trimmed: 201414
bases trimmed due to adapters: 9134238

Duplication rate: 0.768038%

Q1.4 fastp produces an .html report fastp.html by default. Please download it and identify something interesting or unexpected to you and upload along with your homework document [ 1 point ].

It’s interesting to me that the fastp report generated contained KMER counting. One other interesting point is the base content plot, specifically for read1. Before filtering, the base content ratio for each base according to position graph showed oscillating lines. However, after filtering, the lines for base content ratios plotted against position were much smoother.

Task 2: Process fastqs and generate fastqc reports

Q2.1 Each array index will have its own STDERR and STDOUT which by default are written to a single file with naming convention slurm-.out. Please review the contents of the output for index “1” and answer the following [ 1 point ]:

Q2.1a Were any adapter sequences detected?

No adapter sequences were detected

Q2.1b How many reads were in the read 1 set before filtering? Read 2?

Before filtering, read 1 has a total of 58521629 reads while read 2 has a total of 58521629 reads

Q2.1c How many reads survived filtering in Read 1 set? Read 2?

55124556 reads survived filtering in both Read 1 and Read 2 set

Q2.1d What percentage of reads survived filtering in Read 1 set?

About 94.1952% of reads survived filtering in Read 1 set

Q2.1e Copy the contents of your output for index 1 of your job array to your homework file

Processing array index: 1 sample: NA18757
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 58521629
total bases: 5910684529
Q20 bases: 5700891751(96.4506%)
Q30 bases: 5376153278(90.9565%)

Read2 before filtering:
total reads: 58521629
total bases: 5910684529
Q20 bases: 5587494577(94.5321%)
Q30 bases: 5236863942(88.6%)

Read1 after filtering:
total reads: 55124556
total bases: 5566101957
Q20 bases: 5450378785(97.9209%)

Read2 aftering filtering:
total reads: 55124556
total bases: 5566101957
Q20 bases: 5410523196(97.2049%)
Q30 bases: 5092424228(91.49%)

Filtering result:
reads passed filter: 110249112
reads failed due to low quality: 6081064
reads failed due to too many N: 0
reads failed due to too short: 713082
reads with adapter trimmed: 949358
bases trimmed due to adapters: 43433322

Duplication rate: 1.01555%

Insert size peak (evaluated by paired-end reads): 171

JSON report: NA18757.fastp.json
HTML report: NA18757.fastp.html

fastp -i /scratch/work/courses/BI7653/hw2.2022/SRR708363_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/SRR708363_2.filt.fastq.gz -o SRR708363_1.filt.fP.fastq.gz -O SRR708363_2.filt.rP.fastq.gz --length_required 76 --detect_adapter_for_pe --n_base_limit 50 --html NA18757.fastp.html --json NA18757.fastp.json
fastp v0.20.1, time used: 1813 seconds
_ESTATUS_ [ fastp for NA18757 ]: 0
Started analysis of SRR708363_1.filt.fP.fastq.gz
Approx 5% complete for SRR708363_1.filt.fP.fastq.gz
Approx 10% complete for SRR708363_1.filt.fP.fastq.gz
Approx 15% complete for SRR708363_1.filt.fP.fastq.gz
Approx 20% complete for SRR708363_1.filt.fP.fastq.gz
Approx 25% complete for SRR708363_1.filt.fP.fastq.gz
Approx 30% complete for SRR708363_1.filt.fP.fastq.gz
Approx 35% complete for SRR708363_1.filt.fP.fastq.gz
Approx 40% complete for SRR708363_1.filt.fP.fastq.gz
Approx 45% complete for SRR708363_1.filt.fP.fastq.gz
Approx 50% complete for SRR708363_1.filt.fP.fastq.gz
Approx 55% complete for SRR708363_1.filt.fP.fastq.gz
Approx 60% complete for SRR708363_1.filt.fP.fastq.gz
Approx 65% complete for SRR708363_1.filt.fP.fastq.gz
Approx 70% complete for SRR708363_1.filt.fP.fastq.gz
Approx 75% complete for SRR708363_1.filt.fP.fastq.gz
Approx 80% complete for SRR708363_1.filt.fP.fastq.gz
Approx 85% complete for SRR708363_2.filt.rP.fastq.gz
Approx 90% complete for SRR708363_2.filt.rP.fastq.gz
Approx 95% complete for SRR708363_2.filt.rP.fastq.gz
Analysis complete for SRR708363_2.filt.rP.fastq.gz
_ESTATUS_ [ fastqc for NA18757 ]: 0
_END_ [ fastp for NA18757 ]: Thu Feb 10 19:17:52 EST 2022

Q2.2 Confirming quickly that each command line executed successfully can be challenging. Your instructor used echo to print the word ESTATUS and the BASH special variable $? (which reports the exit status of the most recently executed command) after the fastp and fastqc commands to confirm a zero exit status. You can quickly check that all processes completed with a zero exit status by navigating to your Task 2 directory (where the slurm-_.out files are located) and enter: grep ESTATUS slurm*out. Did all commands have an exit status of zero? Please copy the result of the grep command into your homework document [ 1 point ]

Yes, all commands have an exit status of zero as shown below:

slurm-14835460_10.out:_ESTATUS_ [ fastp for HG00149 ]: 0
slurm-14835460_10.out:_ESTATUS_ [ fastqc for HG00149 ]: 0
slurm-14835460_11.out:_ESTATUS_ [ fastp for HG00260 ]: 0
slurm-14835460_11.out:_ESTATUS_ [ fastqc for HG00260 ]: 0
slurm-14835460_12.out:_ESTATUS_ [ fastp for NA18907 ]: 0
slurm-14835460_12.out:_ESTATUS_ [ fastqc for NA18907 ]: 0
slurm-14835460_13.out:_ESTATUS_ [ fastp for NA19137 ]: 0
slurm-14835460_14.out:_ESTATUS_ [ fastp for NA19093 ]: 0
slurm-14835460_15.out:_ESTATUS_ [ fastp for NA19256 ]: 0
slurm-14835460_15.out:_ESTATUS_ [ fastqc for NA19256 ]: 0
slurm-14835460_16.out:_ESTATUS_ [ fastp for NA19098 ]: 0
slurm-14835460_16.out:_ESTATUS_ [ fastqc for NA19098 ]: 0
slurm-14835460_17.out:_ESTATUS_ [ fastp for NA18870 ]: 0
slurm-14835460_18.out:_ESTATUS_ [ fastp for NA18909 ]: 0
slurm-14835460_18.out:_ESTATUS_ [ fastqc for NA18909 ]: 0
slurm-14835460_19.out:_ESTATUS_ [ fastp for NA19138 ]: 0
slurm-14835460_1.out:_ESTATUS_ [ fastp for NA18757 ]: 0
slurm-14835460_1.out:_ESTATUS_ [ fastqc for NA18757 ]: 0
slurm-14835460_20.out:_ESTATUS_ [ fastp for HG00151 ]: 0
slurm-14835460_20.out:_ESTATUS_ [ fastqc for HG00151 ]: 0
slurm-14835460_21.out:_ESTATUS_ [ fastp for HG00106 ]: 0
slurm-14835460_21.out:_ESTATUS_ [ fastqc for HG00106 ]: 0
slurm-14835460_2.out:_ESTATUS_ [ fastp for NA18627 ]: 0
slurm-14835460_2.out:_ESTATUS_ [ fastqc for NA18627 ]: 0
slurm-14835460_3.out:_ESTATUS_ [ fastp for NA18591 ]: 0
slurm-14835460_3.out:_ESTATUS_ [ fastqc for NA18591 ]: 0
slurm-14835460_4.out:_ESTATUS_ [ fastp for NA18566 ]: 0
slurm-14835460_4.out:_ESTATUS_ [ fastqc for NA18566 ]: 0
slurm-14835460_5.out:_ESTATUS_ [ fastp for NA18644 ]: 0
slurm-14835460_5.out:_ESTATUS_ [ fastqc for NA18644 ]: 0
slurm-14835460_6.out:_ESTATUS_ [ fastp for NA18545 ]: 0
slurm-14835460_6.out:_ESTATUS_ [ fastqc for NA18545 ]: 0
slurm-14835460_7.out:_ESTATUS_ [ fastp for HG00113 ]: 0
slurm-14835460_7.out:_ESTATUS_ [ fastqc for HG00113 ]: 0
slurm-14835460_8.out:_ESTATUS_ [ fastp for HG00243 ]: 0
slurm-14835460_8.out:_ESTATUS_ [ fastqc for HG00243 ]: 0
slurm-14835460_9.out:_ESTATUS_ [ fastp for HG00132 ]: 0

Task 3: Use MultiQC to generate a multi-sample QC report

Q3.1 Report your multiqc command

cd /scratch/bl2477/ngs.week2/task2
find $PWD -name \*fastqc.zip > fastqc_files.txt
less fastqc_files.txt
cd /scratch/bl2477/ngs.week2/task3
module load multiqc/1.9
multiqc --file-list /scratch/bl2477/ngs.week2/task2/fastqc_files.txt

Q3.2 Which fastq file has the greatest decline in base quality with increasing sequencing cycle (“the dephasing problem”)? [ 1 point ]

The fastq file ERR251551_1.filt.fP has the greatest decline in base quality with increasing sequencing cycle

Q3.2 Two samples (four fastqs) appear to have unusually high GC content and unusually high duplication levels? Which samples are they? [ 1 point ]

The samples having unusually high GC content and unusually high duplication levels are SRR702073 and SRR766045.

Q3.2 Was there any residual adapter contamination in any fastq file after processing with reads with fastp [ 1 point ]?

There doesn’t seem to be residual adapter contamination in any fastq file after processing with fastp based on the MultiQC report. According to the MultiQC report, no samples found with any adapter contamination is greater than 0.1%.