#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=4GB
#SBATCH --job-name=slurm_template
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=bl2477@nyu.edu
module load fastp/intel/0.20.1
fastp -i /scratch/work/courses/BI7653/hw2.2022/ERR156634_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/ERR156634_2.filt.fastq.gz -o fastpR1.out.fq.gz -O fastpR2.out.fq.gz -l 76 -n 50 --detect_adapter_for_pe
Job Id: 1480736
The exit code is 0 which means that the command that was initiated was ran successfully without encountering any errors.
The difference between a relative and absolute file path is that an absolute path is defined as specifying a file or directory from the root directory as defined by “/”. The absolute path is what is defined as the complete path from the root directory that always start with / directory. In comparison, the relative path is the path that is related with the current working directory that you are working in. For example if I were to access the ngs.week2 directory and I am currently in the /scratch/bl2477 directory:
Absolute path concept:
cd /scratch/bl2477/ngs.week2
Relative path concept:
cd ngs.week2
With the absolute path concept, I would access the ngs.week2 directory by starting at the root directory (/) and making my way to the directory that I want to access. In the relative path concept, I can directory access the ngs.week2 starting from the current directory I am in instead of starting from the root directory.
To read the fastq files or access the fastq files, I used an absolute path which starts with the root directory (/). I used the relative path to write the processed fastq files for the output files to be directly writen to the current working directory. For example fastpR1.out.fq.gz and fastpR2.out.fq.gz output files were directory written to the current directory using the relative path without describing where the current directory is and without starting from the root directory (/).
The name of the STDOUT file for my job is slurm-14807360.out
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=4GB
#SBATCH --job-name=slurm_template
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=bl2477@nyu.edu
#SBATCH -o stdout.txt
module load fastp/intel/0.20.1
fastp -i /scratch/work/courses/BI7653/hw2.2022/ERR156634_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/ERR156634_2.filt.fastq.gz -o fastpR1.out.fq.gz -O fastpR2.out.fq.gz -l 76 -n 50 --detect_adapter_for_pe
#The SBATCH -o stdout.txt command in the job script will redirect the STDOUT to a file named stdout.txt
Before processing:
Q30 bases (Read 1): 1654818604(94.3776%)
Q30 bases (Read 2): 1604981035(91.5352%)
After processing:
Q30 bases (Read 1): 1628465308(94.7822%)
Q30 bases (Read 2): 1588178177(92.4373%)
Filtering result:
reads passed filter: 34367822
reads failed due to low quality: 541724
reads failed due to too many N: 0
reads failed due to too short: 158504
reads with adapter trimmed: 201414
bases trimmed due to adapters: 9134238
Duplication rate: 0.768038%
It’s interesting to me that the fastp report generated contained KMER counting. One other interesting point is the base content plot, specifically for read1. Before filtering, the base content ratio for each base according to position graph showed oscillating lines. However, after filtering, the lines for base content ratios plotted against position were much smoother.
No adapter sequences were detected
Before filtering, read 1 has a total of 58521629 reads while read 2 has a total of 58521629 reads
55124556 reads survived filtering in both Read 1 and Read 2 set
About 94.1952% of reads survived filtering in Read 1 set
Processing array index: 1 sample: NA18757
Detecting adapter sequence for read1...
No adapter detected for read1
Detecting adapter sequence for read2...
No adapter detected for read2
Read1 before filtering:
total reads: 58521629
total bases: 5910684529
Q20 bases: 5700891751(96.4506%)
Q30 bases: 5376153278(90.9565%)
Read2 before filtering:
total reads: 58521629
total bases: 5910684529
Q20 bases: 5587494577(94.5321%)
Q30 bases: 5236863942(88.6%)
Read1 after filtering:
total reads: 55124556
total bases: 5566101957
Q20 bases: 5450378785(97.9209%)
Read2 aftering filtering:
total reads: 55124556
total bases: 5566101957
Q20 bases: 5410523196(97.2049%)
Q30 bases: 5092424228(91.49%)
Filtering result:
reads passed filter: 110249112
reads failed due to low quality: 6081064
reads failed due to too many N: 0
reads failed due to too short: 713082
reads with adapter trimmed: 949358
bases trimmed due to adapters: 43433322
Duplication rate: 1.01555%
Insert size peak (evaluated by paired-end reads): 171
JSON report: NA18757.fastp.json
HTML report: NA18757.fastp.html
fastp -i /scratch/work/courses/BI7653/hw2.2022/SRR708363_1.filt.fastq.gz -I /scratch/work/courses/BI7653/hw2.2022/SRR708363_2.filt.fastq.gz -o SRR708363_1.filt.fP.fastq.gz -O SRR708363_2.filt.rP.fastq.gz --length_required 76 --detect_adapter_for_pe --n_base_limit 50 --html NA18757.fastp.html --json NA18757.fastp.json
fastp v0.20.1, time used: 1813 seconds
_ESTATUS_ [ fastp for NA18757 ]: 0
Started analysis of SRR708363_1.filt.fP.fastq.gz
Approx 5% complete for SRR708363_1.filt.fP.fastq.gz
Approx 10% complete for SRR708363_1.filt.fP.fastq.gz
Approx 15% complete for SRR708363_1.filt.fP.fastq.gz
Approx 20% complete for SRR708363_1.filt.fP.fastq.gz
Approx 25% complete for SRR708363_1.filt.fP.fastq.gz
Approx 30% complete for SRR708363_1.filt.fP.fastq.gz
Approx 35% complete for SRR708363_1.filt.fP.fastq.gz
Approx 40% complete for SRR708363_1.filt.fP.fastq.gz
Approx 45% complete for SRR708363_1.filt.fP.fastq.gz
Approx 50% complete for SRR708363_1.filt.fP.fastq.gz
Approx 55% complete for SRR708363_1.filt.fP.fastq.gz
Approx 60% complete for SRR708363_1.filt.fP.fastq.gz
Approx 65% complete for SRR708363_1.filt.fP.fastq.gz
Approx 70% complete for SRR708363_1.filt.fP.fastq.gz
Approx 75% complete for SRR708363_1.filt.fP.fastq.gz
Approx 80% complete for SRR708363_1.filt.fP.fastq.gz
Approx 85% complete for SRR708363_2.filt.rP.fastq.gz
Approx 90% complete for SRR708363_2.filt.rP.fastq.gz
Approx 95% complete for SRR708363_2.filt.rP.fastq.gz
Analysis complete for SRR708363_2.filt.rP.fastq.gz
_ESTATUS_ [ fastqc for NA18757 ]: 0
_END_ [ fastp for NA18757 ]: Thu Feb 10 19:17:52 EST 2022
cd /scratch/bl2477/ngs.week2/task2
find $PWD -name \*fastqc.zip > fastqc_files.txt
less fastqc_files.txt
cd /scratch/bl2477/ngs.week2/task3
module load multiqc/1.9
multiqc --file-list /scratch/bl2477/ngs.week2/task2/fastqc_files.txt
The fastq file ERR251551_1.filt.fP has the greatest decline in base quality with increasing sequencing cycle
The samples having unusually high GC content and unusually high duplication levels are SRR702073 and SRR766045.
There doesn’t seem to be residual adapter contamination in any fastq file after processing with fastp based on the MultiQC report. According to the MultiQC report, no samples found with any adapter contamination is greater than 0.1%.