Next Generation Sequence Analysis Homework Week 1

Q1.1. Please answer the following questions concerning the directories [ 1 point ].

Q1.1a. Which of these directories should you write output to from jobs submitted on compute nodes? Indicate all that apply from /scratch, /archive, /home

/scratch

Q1.1b. Which of these directories is backed up and can be recovered should the data be lost? Indicate all that apply from /scratch, /archive, /home

/home, /archive

Q1.1c. Which of these directories is flushed every 60 days? Indicate all that apply from /scratch, /archive, /home

/scratch

Q1.1d Execute the “myquota” command to determine how much disk space you have available in each directory. Note that if you are working from a compute node prompt, the /archive directory will not appear because /archive is not mounted on compute nodes. How much space do you have remaining on each of your /home and /scratch and directories?

/home: 50.0GB /scratch: 5.0TB

Task 2: A brief introduction to working with BASH

Q2.1 Scroll down past the header lines of the SAM file to where the alignment records begin and answer the following [ 1 point ].

Q2.1a What is the delimiter between columns of an alignment record (row) (hint: your answer should not be “^I”, You may need to use online resources to answer the question) ?

The delimitter between columns of an alignment record are tabs. Since SAM files stores sequence data in a series of tab-delimitted format, each column of an alignment record is thus separated by tabs.

Q2.1b What does the “$” represent at the end of each line?

The “$” represents end of read at the end of each line.

Q2.2 Execute week1.sh using your preferred method and copy both the command and output into you answers file [ 1 point ].

Command:

cp /scratch/work/courses/BI7653/hw1.2022/week1.sh .
cat week1.sh | less
chmod +x week1.sh
./week1.sh

Output is as follows:

This is the contents of the samfile variable: /scratch/courses/BI7653/hw1.2022/week1.sam
This is the first alignment record in the sam file:
grep: /scratch/courses/BI7653/hw1.2022/week1.sam: No such file or directory
This is the chromosome and position of the first 3 records in the sam
grep: /scratch/courses/BI7653/hw1.2022/week1.sam: No such file or directory
The following is todays time and date:
Mon Feb  7 18:08:43 EST 2022
This is todays time and date: Mon Feb  7 18:08:43 EST 2022

Task 3: Executing a job with sbatch

Q3.1. Now you will create and modify a shell script with a command line (or “terminal”) text editor and execute a shell script and execute as a slurm job.

Command line text editors nano, vim, emacs are available on the HPC. You may launch a text editor simply by typing the name of the editor at the command prompt. nano is the simplest editor available on HPC and recommended for a quick start.

Perform the following tasks after confirming that you are working on a compute node.

  1. confirm you are working on a compute node and your prompt is located in $SCRATCH/ngs.week1
  2. copy the template shell script to the present working directory /scratch/work/courses/BI7653/hw1.2021/slurm_template.sh
  3. rename slurm_template.sh to a name of your choosing
  4. modify the #SBATCH directives in the shell script to increase memory to 10GB, decrease the wall time to 4 hrs, and adding your email using a command line editor.
  5. modify the shell script to include the following commands below the #SBATCH directives and after the “module purge” command. The wget commands will download two files, a BAM and its index file, from the 1000 genomes project.
echo script begin: $(date)
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

echo script completed: $(date)

Report the (1) commands you used for steps 1-5 in your homework answers document answers for Q3.1, (2) report the contents of your script, and (3) report the job id [ 1 point ].

Commands used for 1-5:

cd /scratch/bl2477
mkdir ngs.week1
cp /scratch/work/courses/BI7653/hw1.2022/slurm_template.sh ngs.week1
cd ngs.week1
mv slurm_template.sh slurmjob_template.sh
vim slurmjob_template.sh

Content of script:


#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=4:00:00
#SBATCH --mem=10GB
#SBATCH --job-name=slurm_template
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=bl2477@nyu.edu


module purge

echo script begin: $(date)
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

echo script completed: $(date)

Submit job script command:

sbatch slurmjob_template.sh 

Job ID: 14715512

Q3.2. Submit your job script using the sbatch command and monitor your job status using the command:

squeue -u bl2477 

The job will typically register as pending “PD”, running “R”, or complete “C”. If the job is no longer in the queue then it has completed. If you have a syntax error you can typically identify the problem by reviewing the STDERR of the job, or by reviewing the exit status (see pre-recorded video).

Q3.2 Now answer the following questions [ 1 point ].

Q3.2a What is the job id of your job?

Job ID: 14715512

Q3.2b What are the names of ALL the files in the directory where you launched the job after the job has completed?

Name of the files in the directory:

HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
slurm-14715512.out
slurmjob_template.sh

Q3.2c What is the exit status of your job. To see execute seff <job id>

Job ID: 14715512
Cluster: greene
User/Group: bl2477/bl2477
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:03
CPU Efficiency: 0.45% of 00:11:04 core-walltime
Job Wall-clock time: 00:11:04
Memory Utilized: 1.27 MB
Memory Efficiency: 0.01% of 10.00 GB

Q3.2d How much memory (RAM) was used? Again, try seff <job id>

1.27 MB was used

Q3.3 Answer the following [ 1 point ].

Q3.3a What is the name of the file(s) with the STDERR and STDOUT for your job? (hint: watch the pre-recorded video)?

slurm-14715512.out

Q3.3b What is the output of the “date” command substitution from your script in the STDERR/STDOUT file for your job?

Output:

script begin: Mon Feb 7 18:44:25 EST 2022
script completed: Mon Feb 7 18:55:22 EST 2022

Task 4: Working with software modules

Q4.1. Perform the following steps and save commands and output for your answer using the pre-recorded video (and powerpoint) for help. 1. Load the most recent samtools module (highest version number) (see the pre-recorded video for help with the module load command) 2. Use the “which” command to confirm samtools is now in your path. 3. Print the samtools help to your terminal

samtools --help | head -n 5 # or simply enter "samtools | head -n 5"
  1. List all the modules loaded
  2. Clear your environment by purging loaded modules

Report all command lines and output from Q4.1 for your answer [ 1 point ].

Command (part 1):

module avail samtools

Output(part 1):

--------------------------- /share/apps/modulefiles ----------------------------
   samtools/intel/1.11    samtools/intel/1.12    samtools/intel/1.14

Command (part 1&2):

module load samtools/intel/1.14
which samtools

Output (part 1&2):

/share/apps/samtools/1.14/intel/bin/samtools

Command (part 3):

samtools --help | head -n 5 

Output (part 3):

Program: samtools (Tools for alignments in the SAM format)
Version: 1.14 (using htslib 1.14)

Command (part 4):

module list

Output (part 4):

Currently Loaded Modules:
  1) perl/intel/5.32.0   3) htslib/intel/1.14
  2) intel/19.1.2        4) samtools/intel/1.14

Command (part 5):

module purge
module list

Output:

No modules loaded

Q4.2. Convert the BAM downloaded in Task 3 to SAM format.

Command:

module load samtools/intel/1.14
samtools view -h HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam > HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.sam
head -n 10 HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.sam

First 10 lines of sam file:

@HD VN:1.0  SO:coordinate
@SQ SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:2    LN:243199373    M5:a0d9851da00400dec1098a9255ac712e UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:3    LN:198022430    M5:fdfd811849cc2fadebc929bb925902e5 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:4    LN:191154276    M5:23dccd106897542ad87d2765d28a19a1 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:5    LN:180915260    M5:0740173db9ffd264d728f32784845cd7 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:6    LN:171115067    M5:1d3a93a248d92a729ee764823acbbc6b UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:7    LN:159138663    M5:618366e953d6aaad97dbe4777c29375e UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:8    LN:146364022    M5:96f514a9929e410c6651697bded59aec UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
@SQ SN:9    LN:141213431    M5:3e273117f15e0a400f01055d9f393768 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human

Task 5: Storage considerations on the Greene cluster

Q5.1. What is the size of the SAM file (in human readable bytes)? See the man page for the “du” command and report the human readable file size. [ 1 point ].

cd /scratch/bl2477/ngs.week1
du -h HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.sam

The human readable file size for the SAM file is 2.7G

Q5.2 How did your /scratch quota change relative to your myquota command from Task 1? Include the output from your terminal into your answer (you can highlight text on your console and copy and paste to your homework document) [ 1 point ].

Memory usuage space fom scratch increased when compared to /scratch quota from task 1. Here, 3.30GB are currently being used.

Output:

Filesystem   Environment   Backed up?   Allocation       Current Usage
Space        Variable      /Flushed?    Space / Files    Space(%) / Files(%)

/home        $HOME         Yes/No       50.0GB/30.7K       0.00GB(0.00%)/15(0.05%)
/scratch     $SCRATCH      No/Yes        5.0TB/1.0M       3.30GB(0.06%)/7(0.00%)