Next Generation Sequence Analysis Homework Week 1

This assignment serves as an introduction to the New York University Greene cluster and a brief introduction to working with BASH. Topics covered in the homework include logging in, file system organization, software modules, executing jobs with the slurm job scheduler, and BASH basics.

Before attempting to complete this assignment, you should watch the pre-recorded lecture on the introduction to the hpc in the Week 1 materials. You may wish to complete this assignment in parallel with watching the video.

If you are unfamiliar with the Unix command line, you will need to review online tutorials as these introductory level skills are essential for this course.

This assignment assumes that you have access to a Unix terminal (either on Mac, Linux, or via an emulator such as Cygwin on Windows), an account on the hpc, and can log in (see Task 1).

If you need assistance, post a question on Slack for Week 1.

Completing your assignment

Since you will regularly submit code from shell scripts or R, the preferred way to submit your assignments is to submit an .html file produced by RMarkdown/Markdown and occasionally other files as requested. Your code should be embedded in readable code chunks.

Where possible please upload the answers to questions as a single markdown .html (preferred) report. Other formats are discouraged but .txt, .pdf, and .docx are accepted.

Please always include your name in the filename AND at the top of the document.

Please upload this weeks report to the Homework 1 section at the Assignments link in NYU Brightspace.

Task 1: Log in to HPC, request a compute node, and navigate file system

Open a Unix terminal and log in to Greene by entering the following. If you are working on windows, see “Getting Started->Verify Your HPC Account” on NYU Brightspace for additional instructions. Note that there are alternative approaches, but the following approach will work whether you are on or off an NYU campus. A note on syntax, throughout the course I will use greater than and less than symbols to enclose text that should be replaced. Please replace the enclosed text and remove the greater than and less than symbols before executing.

ssh <net id>@gw.hpc.nyu.edu
<password>

You are now logged into the Bastion host. Now enter:

ssh <net id>@greene.hpc.nyu.edu
<nyu password>

You are now logged into a login node on the NYU Greene cluster. Login nodes are shared among all HPC users so it is essential that you switch to a compute node before performing any tasks.

From the login command prompt, you can request a compute node using the following syntax. This will request a node for 4 hrs, with 4 GB of memory, and using BASH as the shell. A prompt will appear upon completion.

srun --time=4:00:00 --mem=4GB --pty /bin/bash

HPC users have a personal directory in /archive, /home, and /scratch directories with read, write and execute permissions.

You can use environmental variables HOME, ARCHIVE, SCRATCH which contain the full paths to the corresponding directories defined in your environment to locate and navigate to the directories as follows. The directive “echo $HOME” is a request to “expand” the environmental variable called HOME and print the value of the variable to your terminal by default. Execute the following at your command prompt:

pwd
echo $HOME
echo $ARCHIVE
echo $SCRATCH
cd $SCRATCH
pwd

These directories differ in important ways including the disk space allocated to each, whether they are flushed, and hardware-related factors including I/O speed.

Please make sure you have watched the pre-recorded video for more information about these directories and what types of files should be written to them.

Q1.1. Please answer the following questions concerning the directories [ 1 point ].

Q1.1a. Which of these directories should you write output to from jobs submitted on compute nodes? Indicate all that apply from /scratch, /archive, /home

Q1.1b. Which of these directories is backed up and can be recovered should the data be lost? Indicate all that apply from /scratch, /archive, /home

Q1.1c. Which of these directories is flushed every 60 days? Indicate all that apply from /scratch, /archive, /home

Q1.1d Execute the “myquota” command to determine how much disk space you have available in each directory. Note that if you are working from a compute node prompt, the /archive directory will not appear because /archive is not mounted on compute nodes. How much space do you have remaining on each of your /home and /scratch and directories?

Task 2: A brief introduction to working with BASH

This course requires that you can execute basic commands in BASH. This section provides a very brief introduction to a few useful commands, but is not intended to be a comprehensive introduction.

You should be working from a compute node, so please see steps above to request one if you are working from a login node.

Begin by changing directories to your personal directory on /scratch. Then create a directory called “ngs.week1”

pwd
mkdir ngs.week1
cd ngs.week1
pwd
ls

Piping and redirection of streams STDIN, STDOUT and STDERR are fundamental to working effectively with the Unix commandline and third party tools developed for NGS analysis.

To illustrate, print “hello world” to STDOUT and redirect STDOUT to a file named helloworld.txt using “>” and then append to the file using “>>”. Note that any time you redirect STDOUT with “>” it will overwrite an existing file of the same name.

echo hello world > helloworld.txt
ls
echo hello world again >> helloworld.txt
less helloworld.txt # Enter q to exit and return to prompt

STDIN can also be used by many tools as an input stream, for example, here we pass text printed to STDOUT by the perl command to STDIN of the cat command, which then writes the output to STDOUT. STDOUT is printed to the terminal if it is not redirected to a file elsewhere.

perl -e 'print "hello world\n"' | cat -et
perl -e 'print "hello\tworld\n"' | cat -et

The above illustrates the concept of piping with “|”.

The above commands also illustrates the use of cat -et to distinguish delimiters (note the difference in the outputs of the two commands).

If you are unfamiliar with any command or its options, you can use the “man” function to get information. For example, if you want to learn about the -e and -t options for cat enter:

man cat # enter q to exit the manual view and return to the command prompt, use arrows to scroll through the "man" pages
cat helloworld.txt # this prints the content of the file to STDOUT, which is your console by default

Try using “man” on some of the other commands above and type “q” to exit when you are done.

Now you can try renaming and deleting a file.

mv helloworld.txt helloworld2.txt
ls
rm helloworld2.txt
ls

Frequently in NGS analysis, we have very large text files such as SAM (.sam) that can’t be “opened”. We often want to review the contents and assess, for example, whether the file is space or tab-delimited (or something else).

To do so, we might pipe the STDOUT of cat to “less” or “more” commands. Review the contents of a SAM (.sam) file (a text file with .sam file extension) located in the course directory for Week 1.

cat -et /scratch/work/courses/BI7653/hw1.2021/week1.sam | less

Q2.1 Scroll down past the header lines of the SAM file to where the alignment records begin and answer the following [ 1 point ].

Q2.1a What is the delimiter between columns of an alignment record (row) (hint: your answer should not be “^I”, You may need to use online resources to answer the question) ?

Q2.1b What does the “$” represent at the end of each line?

Frequently, we would like to combine multiple BASH commands into a shell script. Executing a shell script in this fashion is different from executing a job on the NYU HPC, as we shall see.

Your instructor has prepared a shell script for you to review and execute. Begin by copying the shell script to your present working directory (indicated by “.” below)

cp /scratch/work/courses/BI7653/hw1.2021/week1.sh .
cat week1.sh | less

The script illustrates the use of a shebang line, the convention of using “#” before comments (not interpreted by BASH), examples of variable definitions, the cut and grep command, variable expansion, and command substitution (e.g., “$(date)”, where date is a BASH function).

Note that variable names can use either upper or lower case or a mix of each in BASH, but by convention “shell variables” (whose scope is limited to a particular process) are frequently defined with lower case letters to distinguish them from “environmental variables” which are upper case (e.g., SCRATCH) and have global scope.

Next, we want to execute the shell script. There are multiple ways to do so. The first is to make the script executable then execute as shown here:

chmod +x week1.sh # This makes the script week1.sh executable
./week1.sh  # This executess the commands in week1.sh

Note that executing in this fashion–for most languages–requires an appropriately defined shebang line which instructs BASH what language the script is written in and what interpreter to use (e.g. python). Although we often include a shebang line in BASH scripts, this is not strictly necessary because the interpreter will assume the script is BASH if none is defined. A perl or python script, for example, would require an appropriate shebang line to execute using the above approach.

An alternative approach is to execute without making the script executable. You can do this by specifying the interpreter (for example, BASH, python, perl, Rscript etc.) explicitly as shown below. In this instance, a shebang line is not required.

bash week1.sh

Q2.2 Execute week1.sh using your preferred method and copy both the command and output into you answers file [ 1 point ].

Task 3: Executing a job with sbatch

The Greene cluster uses the slurm job scheduler to allocate resources to a job such as the amount of memory (i.e, RAM),the number of cpus, and the run time. Users may customize their resource requests at time of execution via directives provided in a job submission script. Following job submission (i.e., using the sbatch command), the scheduler places the job request in a queue and executes the script once sufficient resources are available (frequently within seconds or minutes).

Review the first few lines of a typical script for job submission via the slurm sbatch command. The script is written in BASH, although it will be executed with the slurm sbatch command and has special directives indicated beginning with “#SBATCH”.

less /scratch/work/courses/BI7653/hw1.2021/slurm_template.sh

The “#” symbol prior to “#SBATCH” instructs the BASH interpreter to ignore the line, but upon execution with sbatch these directives will be interpreted by the slurm scheduler and will customize the configuration of the job request. The arguments pre-pended by a “–” are arguments and their corresponding values (after the “=”) determine the amount of resources or other information specific to the job.

For this course, you will always use –nodes=1 (because none of the software we will use is engineered for multiple nodes) and –tasks-per-node=1. The –cpus-per-task option allows you to configure a job that runs multi-threaded software. If you wish to run a third-party software with 4 threads, you would specify –cpus-per-task=4.

You can review all the options for the sbatch job submission command in the manual page:

man sbatch # enter q to exit and return to command prompt

Now you will submit a slurm job script and review the STDOUT and STDERR for the job script. Unlike execution of a shell script, you will use the sbatch command. The STDOUT and STDERR for sbatch are written to output files (see pre-recorded video) and cannot be redirected with “>” and “2>” syntax.

A template for the script is provided for you which you may wish to use throughout the course is provided at /scratch/work/courses/BI7653/hw1.2021/slurm_template.sh. Once you have copied to your /scratch, you will modify the script to change resource requests and download a BAM alignment and its index file from the 1000 Genomes Project.

Q3.1. Now you will create and modify a shell script with a command line (or “terminal”) text editor and execute a shell script and execute as a slurm job.

Command line text editors nano, vim, emacs are available on the HPC. You may launch a text editor simply by typing the name of the editor at the command prompt. nano is the simplest editor available on HPC and recommended for a quick start.

Perform the following tasks after confirming that you are working on a compute node.

  1. confirm you are working on a compute node and your prompt is located in $SCRATCH/ngs.week1
  2. copy the template shell script to the present working directory /scratch/work/courses/BI7653/hw1.2021/slurm_template.sh
  3. rename slurm_template.sh to a name of your choosing
  4. modify the #SBATCH directives in the shell script to increase memory to 10GB, decrease the wall time to 4 hrs, and adding your email using a command line editor.
  5. modify the shell script to include the following commands below the #SBATCH directives and after the “module purge” command. The wget commands will download two files, a BAM and its index file, from the 1000 genomes project.
echo script begin: $(date)
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

echo script completed: $(date)

Report the (1) commands you used for steps 1-5 in your homework answers document answers for Q3.1, (2) report the contents of your script, and (3) report the job id [ 1 point ].

Q3.2. Submit your job script using the sbatch command and monitor your job status using the command:

squeue -u <user id> 

The job will typically register as pending “PD”, running “R”, or complete “C”. If the job is no longer in the queue then it has completed. If you have a syntax error you can typically identify the problem by reviewing the STDERR of the job, or by reviewing the exit status (see pre-recorded video).

Q3.2 Now answer the following questions [ 1 point ].

Q3.2a What is the job id of your job?

Q3.2b What are the names of ALL the files in the directory where you launched the job after the job has completed?

Q3.2c What is the exit status of your job. To see execute seff <job id>

Q3.2d How much memory (RAM) was used? Again, try seff <job id>

Q3.3 Answer the following [ 1 point ]. Q3.3a What is the name of the file(s) with the STDERR and STDOUT for your job? (hint: watch the pre-recorded video)?

Q3.3b What is the output of the “date” command substitution from your script in the STDERR/STDOUT file for your job?

Task 4: Working with software modules

Confirm you are working from a compute node for the following task or request one if you are not.

All third-party software you need for this course has been installed on the HPC in the form of software modules. In order to execute a software (e.g., samtools), you must first load the module. Once the module is loaded, you the executable (binary) will be in your PATH, and can be executed directly.

Review your PATH variable. This variable contains the set of directories that are searched when you execute a command (such as ls, cat, grep) or a third party tool such as samtools.

echo $PATH

The “which” command returns the full path to an executable if it is in a directory in your user PATH variable (note: you never need the full path to execute software for this course, but the “which” commands shows you if the software is in your PATH).

Enter the following commands and confirm that the “cat” executable (or “binary”) is in your path but the samtools executable is not.

which cat
which samtools

Since samtools is not in your PATH, the path to it is not returned.

Now load samtools so that the executable is in your PATH. How can we do that? You can search all available software on the HPC or restrict the search as follows:

module avail              # module avail is a slurm command that shows all modules on HPC
module avail samtools     # shows all samtools versions
module avail sam          # shows all modules containing word sam

Typically you can use the most recent version of a software for this course unless instructed otherwise.

Q4.1. Perform the following steps and save commands and output for your answer using the pre-recorded video (and powerpoint) for help. 1. Load the most recent samtools module (highest version number) (see the pre-recorded video for help with the module load command) 2. Use the “which” command to confirm samtools is now in your path. 3. Print the samtools help to your terminal

samtools --help | head -n 5 # or simply enter "samtools | head -n 5"
  1. List all the modules loaded
  2. Clear your environment by purging loaded modules

Report all command lines and output from Q4.1 for your answer [ 1 point ].

Q4.2. Convert the BAM downloaded in Task 3 to SAM format.

Change directory (cd) to the directory where the BAM file you downloaded from the 1000 Genomes Project is located then perform the BAM->SAM conversion:

samtools view -h <bam file> > <sam file>

**note: samtools view writes SAM to stdout. By convention, you should name the output SAM file name with same basename (everything before file extension .bam) as the BAM but with the .sam file extension.

For your Q4.2 answer, report the first 10 lines of the sam file (e.g., head -n 10 ). If you submit via a markdown document (e.g., Rmarkdown), please include such text in a code block for readability [ 1 point ]

Task 5: Storage considerations on the Greene cluster

When working with NGS files on a shared file system, it is important to keep track of your disk usage since some directories have relatively small space allocations per user.

Navigate to the directory where you conducted the BAM->SAM conversion and answer the following. Include your commands and answers to questions for your answer

Q5.1. What is the size of the SAM file (in human readable bytes)? See the man page for the “du” command and report the human readable file size. [ 1 point ].

Q5.2 How did your /scratch quota change relative to your myquota command from Task 1? Include the output from your terminal into your answer (you can highlight text on your console and copy and paste to your homework document) [ 1 point ].