Introduction to Linux and command lines for NGS data analysis

0.1 Objectives

In this session we will delve deeper into the shell and BASH scripting. We will…;

Learn more about the Linux file system
Create our first bash script

“Know the rules well, so you can break them effectively.” — Dalai Lama XIV

0.2 Introduction

Linux is an open-source operating system widely used in bioinformatics and NGS data analysis due to its flexibility, efficiency, and command line interface. Linux provides a robust environment for running bioinformatics tools and pipelines, allowing researchers to efficiently analyze large-scale NGS datasets.

The terminal, also known as the Command line interface on the other hand is an interface that allows users to interact with the Linux system through commands. Command line interfaces offer several advantages, such as the ability to automate tasks, work with remote servers, and efficiently manipulate files and data.

Task automation using commandline scripts

0.3 Recacap: Basic Linux commands

So far we have learned how to;

Establish our current working directory, pwd
Create a new directory, mkdir
Determine the contents of a directory, ls

We have also learned how to;

Change directory or move into a different directory, cd
Read the contents of a text file, cat, less, head, tail?
Manage Linux permissions, chmod
Get help, man
Delete files, rm
Copy / rename files, cp
Edit text files, nano
Use grep to find information in files

0.3.1 The Pipe Operator

The pipe is used to combine two or more commands, and in this case, the output of one command acts as input to another command, and this command’s output may act as input to the next command, and so on. It can also be visualized as a temporary connection between two or more commands/ programs/ processes.


echo "Here is some text" | cut -d " " -f3

## some

0.4 The Linux File System

The file system is how our files and directories are organised. At the heart of this system is a series of important files and folders that we need to know in order to effectively organise our work going forward. We will look at these in a bit more detail.

The Linux File System

The system directories contain files created by the operating system, system software, and user installed programs which should not be edited by the user. In rare cases, you may need to edit the configuration files for example, /etc/host.conf, /etc/hostname.

One other exception is the /var/www/html where we put the website files for self hosted projects or accessible on a computer with a public IP address.

Other important directories to note are:

0.4.1 Special directories and directory notations

There are some special characters that denote specific directories. Learning these will help you to traverse the file system with a lot more fluidity and to understand other peoples representations of file paths.

Root directory: The very top of the hierarchy is the / which is usually referred to as the root directory. All files and folders present in your system can be found via the root.

Home directory: The home directory contains a system generated folder for each user on the system. The directory is designed to hold the working files and personal configurations of the user. The user has full rights and permissions over their home directory. Other users who are not admins cannot access a users home directory. When you open your terminal, you are automatically placed in your home directory.

As you can see, both produce the same results

Current directory: In use a single period

ls .

## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt

Parent directory:

ls ..

## AGMT
## HarvadDurbanWorkshop
## MRC Uganda
## Tutorial_Bash
## UFS
## VEME 2023

0.4.2 Absolute vs Relative Paths

Files and directories can be specified using either a relative path or a absolute path. Absolute paths always start with the root directory and provide the full path to the file or directory. On the other hand, a relative path is a path to a file or directory that is relative to the current directory. It specifies the location of the file or directory in relation to the current directory. Relative are usually shorter than absolute paths.

To understand the difference between the two, let us look at an example.

First let us see the contents of our current directory

ls

## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt

Now enter the pwd command and you should see:

pwd

## /Users/sanem/temp/CoursesandTrainings/Tutorial_Bash

which is the full name or path to your current directory. In my case, it tells me that I am in a directory called Tutorial_BASH, which sits inside a directory called Courses and Trainings and down all the way to the very bottom of the of hierarchy.

We can convert this path into a relative path by;

ls ~/temp/CoursesandTrainings/Tutorial_Bash

## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt

ls ../../../temp/CoursesandTrainings/Tutorial_Bash

## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt

0.5 Shell and Environment variables - System and User Defined

Linux Process Management In Linux, a variable is a named argument with an associated value. Variables can be classified into two main categories namely shell and environment variables. Environment variables are variables that are available system-wide and are inherited by all spawned child processes and shells while shell variables are variables that apply only to the current shell instance. Each shell such as zsh and bash, has its own set of internal shell variables.

0.5.1 Naming Convention for variables

The names of the variables are case-sensitive. By convention, environment variables should have UPPER CASE names.
When assigning multiple values to the variable they must be separated by the colon : character.
There is no space around the equals = symbol.

0.5.2 Shell variables

On top of the shell specific variables, we can also create shell variables for our own use.


PROJECT_DIR=$HOME/project1 #create a variable that holds the path to a project1 folder. 
echo $PROJECT_DIR           #Print the contents of the variable.

## /Users/sanem/project1

0.5.3 Convert user-defined shell variables to environment variables

Our user defined variables only exist in the context of the current shell. To make them accessible beyond the scope of the current shell i.e. to subshells, we use the export keyword.


PROJECT_DIR=$HOME/project1 #create a variable that holds the path to a project1 folder. 
export $PROJECT_DIR           #Print the contents of the variable.

export -p                     # View all the exported variables in the current shell.

0.5.4 Environment variables

In Linux and Unix based systems, environment variables are a set of dynamic named values, stored within the system that are used by applications launched in shells or subshells.

There are several commands available that allow you to list and set environment variables in Linux:

env – The command allows you to run another program in a custom environment without modifying the current one. When used without an argument it will print a list of the current environment variables.


env | head -n3

## NOT_CRAN=true
## SED=/usr/bin/sed
## LN_S=ln -s

printenv – The command prints all or the specified environment variables.

printenv HOME

## /Users/sanem

set – The command sets or unsets shell variables. When used without an argument it will print a list of all variables including environment and shell variables, and shell functions.


set | head -n1

## BASH=/bin/bash

You can also use echo to view the contents of shell and environment variables

echo $USER      #– This points to the currently logged-in user.
echo $HOME      #– This shows the home directory of the current user.
echo $SHELL     #– This stores the path of the current user's shell, such as bash or zsh.

## sanem
## /Users/sanem
## /bin/bash

0.5.5 Persistent Environment Variables

Sometimes we have variables we want to re-use often and therefore do not want to have to export them each time. We can make these always available to us by defining them in conguration files. There are several config files but here are some important ones.

/etc/environment - Use this file to set up system-wide environment variables. Variables in this file are set in the following format
/etc/profile - Variables set in this file are loaded whenever a bash login shell is entered. When declaring environment variables in this file you need to use the export command. The export command sets the environment variable.

export PROJECT_DIR=$HOME/project1

Per-user shell specific configuration files. For example, if you are using Bash, you can declare the variables in the ~/.bashrc.


export PROJECT_DIR=$HOME/project1

To load the new environment variables into the current shell session use the source command:


source ~/.bashrc

When an environment or shell variable is nolonger needed, use unset delete it.

PROJECT_DIR=$HOME/project1 ## we set the project variable
echo $PROJECT_DIR   ## print it out and surely get a result
unset PROJECT_DIR   ## unset it
echo $PROJECT_DIR     ## and try to print again. This time nothing is printed.

## /Users/sanem/project1

0.5.6 The PATH Environment Variable

PATH is essentially a : -separated list of directories. When you execute a command, the shell searches through each of these directories, one by one, until it finds a directory where the executable exists.

Let us inspect the current list of directories in our PATH variable, this is helpful for us not to add a path that is already in the there.

echo $PATH

Once we are certain that it is not there, we can add the location to the PATH variable and broadcast the modified variable to the environment.

export PATH=/new/temp/path:$PATH

echo $PATH

0.6 Batch Processing of Files

Sometimes we want to bulk create, read, update and delete files. Linux allows us to do this in various ways.

0.6.1 for loop;

[ ! -d "some_files " ] && mkdir some_files 

cd some_files

# Create 10 files with sample data
for ((i=1; i<=10; i++))
do
    filename="file_$i.txt"
    echo "This is file $i." > $filename
    echo "Some sample data for file $i." >> $filename
    echo "File $i created."
done

## mkdir: cannot create directory ‘some_files’: File exists
## File 1 created.
## File 2 created.
## File 3 created.
## File 4 created.
## File 5 created.
## File 6 created.
## File 7 created.
## File 8 created.
## File 9 created.
## File 10 created.

0.6.2 Using GNU Parallel

[ ! -d "some_files2 " ] && mkdir some_files2 
cd some_files2
parallel 'filename="file_{}.txt" && echo "Some sample data for file {}." >> $filename && echo "File {} created."' ::: $(seq -w 10)

## mkdir: cannot create directory ‘some_files2’: File exists
## File 01 created.
## File 02 created.
## File 03 created.
## File 04 created.
## File 05 created.
## File 06 created.
## File 07 created.
## File 08 created.
## File 09 created.
## File 10 created.

We can now gzip all the files that we have created.

cd some_files2

find . -name '*.txt' | parallel gzip --best

ls

## gzip: ./file_09.txt.gz already exists -- skipping
## gzip: ./file_08.txt.gz already exists -- skipping
## gzip: ./file_06.txt.gz already exists -- skipping
## gzip: ./file_07.txt.gz already exists -- skipping
## gzip: ./file_05.txt.gz already exists -- skipping
## gzip: ./file_10.txt.gz already exists -- skipping
## gzip: ./file_04.txt.gz already exists -- skipping
## gzip: ./file_01.txt.gz already exists -- skipping
## gzip: ./file_03.txt.gz already exists -- skipping
## gzip: ./file_02.txt.gz already exists -- skipping
## file_01.txt
## file_01.txt.gz
## file_02.txt
## file_02.txt.gz
## file_03.txt
## file_03.txt.gz
## file_04.txt
## file_04.txt.gz
## file_05.txt
## file_05.txt.gz
## file_06.txt
## file_06.txt.gz
## file_07.txt
## file_07.txt.gz
## file_08.txt
## file_08.txt.gz
## file_09.txt
## file_09.txt.gz
## file_10.txt
## file_10.txt.gz

More practice examples for gnu parallels can be found here.

0.6.3 Downloading files using curl and wget

Client URL (cURL, pronounced “curl”) and wget are both command-line tools used to retrieve data from internet through a terminal. They use different protocols to perform this task, with curl supporting a variety of protocols, including HTTP, HTTPS, FTP, FTPS, SCP, SFTP, and more. Wget, on other hand, primarily supports HTTP and FTP protocols.

Practice: Often we need to install or update the java runtime environment on our computers. We will use wget and curl to download the RDP taxonomic training data formatted for DADA2.


wget --no-verbose https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1

To modify the file name of the download file, use the -O option


wget --no-verbose -O rdp_species_assignment_18.fa.gz https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1

We can also use curl to download the file.

curl -s https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1 --output rdp_species_assignment_18.fa.gz

If you use curl without any options except for the URL, the content of the URL (whether it’s a webpage, or a binary file, such as an image or a zip file) will be printed out to screen.

Use the -X parameter of curl to specify the method that curl should use.

0.7 Bash Scripting

A script is a sequence of commands that together complete a task. The task is usually requires more than just a one line command, such as in the example we have seen above.

Let us write a first script that greets the logged in user and tells him the directory in which he is.

#!/bin/bash

luser=$(echo $USER)
currdir=`pwd`

echo "Hello ${luser},"
echo ""
echo "You are currently in ${currdir}."
echo ""

The first line of any script should be a line that specifies what interpreter is to be used for this script. This line is commonly known as “hash bang” or “shebang”. The first two characters of this line are #! followed by the path of an interpreter to use. In our examples we will be using the following #!/bin/bash.

Note the $() and pwd

Now save the file as helloUser.sh and make it executable by typing, sudo chmod +x helloUser.sh.

Let us run the script in our commandline by typing, ./helloUser.sh and then press enter.

0.8 “Sudo” users or “sudoers”

Notice that we used sudo before the chmod to change the file permissions. The sudo command allows a permitted user to run a command as the superuser (root user) or another user, as specified by the security policy on Unix based systems. All accounts with admin privilleges are also sudo users. On Linux servers, a new user can be assigned sudo privilleges by adding the to the file ‘/etc/sudoers’ or by adding the to the group, “wheel”.

So how then do we run our script if we are not sudoers? we will ask bash to run it for us.


bash helloUser.sh

## Hello sanem,
## 
## You are currently in /Users/sanem/temp/CoursesandTrainings/Tutorial_Bash.

0.9 Organising our project analysis project

Exercise: In 5 groups of 5, work together to determine the core components of a project folder.

0.10 Resources for further learning

Codeacademy
Hackerrank
Babraham Bioinformatics training course. Various courses on R, Python, Linux, Perl and Machine Learning.

Introduction to Linux and command lines for NGS data analysis - Part 1

San Emmanuel James

July 04, 2023