In this session we will delve deeper into the shell and BASH scripting. We will…;
“Know the rules well, so you can break them effectively.” — Dalai Lama XIV
Linux is an open-source operating system widely used in bioinformatics and NGS data analysis due to its flexibility, efficiency, and command line interface. Linux provides a robust environment for running bioinformatics tools and pipelines, allowing researchers to efficiently analyze large-scale NGS datasets.
The terminal, also known as the Command line interface on the other hand is an interface that allows users to interact with the Linux system through commands. Command line interfaces offer several advantages, such as the ability to automate tasks, work with remote servers, and efficiently manipulate files and data.
So far we have learned how to;
pwdmkdirlsWe have also learned how to;
cdcat,
less, head, tail?chmodmanrmcpnanogrep to find information in filesThe pipe is used to combine two or more commands, and in this case, the output of one command acts as input to another command, and this command’s output may act as input to the next command, and so on. It can also be visualized as a temporary connection between two or more commands/ programs/ processes.
echo "Here is some text" | cut -d " " -f3
## some
The file system is how our files and directories are organised. At the heart of this system is a series of important files and folders that we need to know in order to effectively organise our work going forward. We will look at these in a bit more detail.
The system directories contain files created
by the operating system, system software, and user installed programs
which should not be edited by the user. In rare cases, you may need to
edit the configuration files for example,
/etc/host.conf,
/etc/hostname.
One other exception is the /var/www/html where we put
the website files for self hosted projects or accessible on a computer
with a public IP address.
Other important directories to note are:
There are some special characters that denote specific directories. Learning these will help you to traverse the file system with a lot more fluidity and to understand other peoples representations of file paths.
Root directory: The very top of the hierarchy is the
/ which is usually referred to as the root directory. All
files and folders present in your system can be found via the root.
Home directory: The home directory contains a system generated folder for each user on the system. The directory is designed to hold the working files and personal configurations of the user. The user has full rights and permissions over their home directory. Other users who are not admins cannot access a users home directory. When you open your terminal, you are automatically placed in your home directory.
or
As you can see, both produce the same results
Current directory: In use a single period
ls .
## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt
Parent directory:
ls ..
## AGMT
## HarvadDurbanWorkshop
## MRC Uganda
## Tutorial_Bash
## UFS
## VEME 2023
Files and directories can be specified using either a relative path or a absolute path. Absolute paths always start with the root directory and provide the full path to the file or directory. On the other hand, a relative path is a path to a file or directory that is relative to the current directory. It specifies the location of the file or directory in relation to the current directory. Relative are usually shorter than absolute paths.
To understand the difference between the two, let us look at an example.
First let us see the contents of our current directory
ls
## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt
Now enter the pwd command and you should see:
pwd
## /Users/sanem/temp/CoursesandTrainings/Tutorial_Bash
which is the full name or path to your current directory. In my case, it tells me that I am in a directory called Tutorial_BASH, which sits inside a directory called Courses and Trainings and down all the way to the very bottom of the of hierarchy.
We can convert this path into a relative path by;
ls ~/temp/CoursesandTrainings/Tutorial_Bash
## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt
ls ../../../temp/CoursesandTrainings/Tutorial_Bash
## BASH_Tutorial.Rmd
## BASH_Tutorial.html
## NGS-analysis.Rmd
## NGS-analysis.html
## SRR_Acc_List.txt
## SRR_Acc_List5.txt
## SraRunTable.txt
## Tutorial_Bash.Rproj
## bams
## diagnostech-post-ngs-figure1-new.jpeg
## figs
## helloUser.sh
## rsconnect
## run469
## seqpanther_
## some_files
## some_files2
## sra_download
## toSort.txt
In Linux, a variable is a named
argument with an associated value. Variables can be classified into two
main categories namely shell and environment variables. Environment
variables are variables that are available system-wide and are inherited
by all spawned child processes and shells while shell variables are
variables that apply only to the current shell instance. Each shell such
as zsh and bash, has its own set of internal shell variables.
The names of the variables are case-sensitive. By convention, environment variables should have UPPER CASE names.
When assigning multiple values to the variable they must be separated by the colon : character.
There is no space around the equals = symbol.
On top of the shell specific variables, we can also create shell variables for our own use.
PROJECT_DIR=$HOME/project1 #create a variable that holds the path to a project1 folder.
echo $PROJECT_DIR #Print the contents of the variable.
## /Users/sanem/project1
Our user defined variables only exist in the context of the current
shell. To make them accessible beyond the scope of the current shell
i.e. to subshells, we use the export keyword.
PROJECT_DIR=$HOME/project1 #create a variable that holds the path to a project1 folder.
export $PROJECT_DIR #Print the contents of the variable.
export -p # View all the exported variables in the current shell.
In Linux and Unix based systems, environment variables are a set of dynamic named values, stored within the system that are used by applications launched in shells or subshells.
There are several commands available that allow you to list and set environment variables in Linux:
env – The command allows you to run another program in a
custom environment without modifying the current one. When used without
an argument it will print a list of the current environment
variables.
env | head -n3
## NOT_CRAN=true
## SED=/usr/bin/sed
## LN_S=ln -s
printenv – The command prints all or the specified
environment variables.
printenv HOME
## /Users/sanem
set – The command sets or unsets shell variables. When
used without an argument it will print a list of all variables including
environment and shell variables, and shell functions.
set | head -n1
## BASH=/bin/bash
You can also use echo to view the contents of shell and environment variables
echo $USER #– This points to the currently logged-in user.
echo $HOME #– This shows the home directory of the current user.
echo $SHELL #– This stores the path of the current user's shell, such as bash or zsh.
## sanem
## /Users/sanem
## /bin/bash
Sometimes we have variables we want to re-use often and therefore do not want to have to export them each time. We can make these always available to us by defining them in conguration files. There are several config files but here are some important ones.
/etc/environment - Use this file to set up
system-wide environment variables. Variables in this file are set in the
following format
/etc/profile - Variables set in this file are loaded
whenever a bash login shell is entered. When declaring environment
variables in this file you need to use the export command. The
export command sets the environment variable.
export PROJECT_DIR=$HOME/project1
~/.bashrc.
export PROJECT_DIR=$HOME/project1
To load the new environment variables into the current shell session use the source command:
source ~/.bashrc
When an environment or shell variable is nolonger needed, use
unset delete it.
PROJECT_DIR=$HOME/project1 ## we set the project variable
echo $PROJECT_DIR ## print it out and surely get a result
unset PROJECT_DIR ## unset it
echo $PROJECT_DIR ## and try to print again. This time nothing is printed.
## /Users/sanem/project1
PATH is essentially a : -separated list of directories. When you execute a command, the shell searches through each of these directories, one by one, until it finds a directory where the executable exists.
Let us inspect the current list of directories in our PATH variable, this is helpful for us not to add a path that is already in the there.
echo $PATH
Once we are certain that it is not there, we can add the location to the PATH variable and broadcast the modified variable to the environment.
export PATH=/new/temp/path:$PATH
echo $PATH
Sometimes we want to bulk create, read, update and delete files. Linux allows us to do this in various ways.
[ ! -d "some_files " ] && mkdir some_files
cd some_files
# Create 10 files with sample data
for ((i=1; i<=10; i++))
do
filename="file_$i.txt"
echo "This is file $i." > $filename
echo "Some sample data for file $i." >> $filename
echo "File $i created."
done
## mkdir: cannot create directory ‘some_files’: File exists
## File 1 created.
## File 2 created.
## File 3 created.
## File 4 created.
## File 5 created.
## File 6 created.
## File 7 created.
## File 8 created.
## File 9 created.
## File 10 created.
[ ! -d "some_files2 " ] && mkdir some_files2
cd some_files2
parallel 'filename="file_{}.txt" && echo "Some sample data for file {}." >> $filename && echo "File {} created."' ::: $(seq -w 10)
## mkdir: cannot create directory ‘some_files2’: File exists
## File 01 created.
## File 02 created.
## File 03 created.
## File 04 created.
## File 05 created.
## File 06 created.
## File 07 created.
## File 08 created.
## File 09 created.
## File 10 created.
We can now gzip all the files that we have created.
cd some_files2
find . -name '*.txt' | parallel gzip --best
ls
## gzip: ./file_09.txt.gz already exists -- skipping
## gzip: ./file_08.txt.gz already exists -- skipping
## gzip: ./file_06.txt.gz already exists -- skipping
## gzip: ./file_07.txt.gz already exists -- skipping
## gzip: ./file_05.txt.gz already exists -- skipping
## gzip: ./file_10.txt.gz already exists -- skipping
## gzip: ./file_04.txt.gz already exists -- skipping
## gzip: ./file_01.txt.gz already exists -- skipping
## gzip: ./file_03.txt.gz already exists -- skipping
## gzip: ./file_02.txt.gz already exists -- skipping
## file_01.txt
## file_01.txt.gz
## file_02.txt
## file_02.txt.gz
## file_03.txt
## file_03.txt.gz
## file_04.txt
## file_04.txt.gz
## file_05.txt
## file_05.txt.gz
## file_06.txt
## file_06.txt.gz
## file_07.txt
## file_07.txt.gz
## file_08.txt
## file_08.txt.gz
## file_09.txt
## file_09.txt.gz
## file_10.txt
## file_10.txt.gz
More practice examples for gnu parallels can be found here.
Client URL (cURL, pronounced “curl”) and wget are both command-line tools used to retrieve data from internet through a terminal. They use different protocols to perform this task, with curl supporting a variety of protocols, including HTTP, HTTPS, FTP, FTPS, SCP, SFTP, and more. Wget, on other hand, primarily supports HTTP and FTP protocols.
Practice: Often we need to install or update the java runtime environment on our computers. We will use wget and curl to download the RDP taxonomic training data formatted for DADA2.
wget --no-verbose https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1
To modify the file name of the download file, use the -O option
wget --no-verbose -O rdp_species_assignment_18.fa.gz https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1
We can also use curl to download the file.
curl -s https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz?download=1 --output rdp_species_assignment_18.fa.gz
If you use curl without any options except for the URL,
the content of the URL (whether it’s a webpage, or a binary file, such
as an image or a zip file) will be printed out to screen.
Use the -X parameter of curl to specify the method that curl should use.
A script is a sequence of commands that together complete a task. The task is usually requires more than just a one line command, such as in the example we have seen above.
Let us write a first script that greets the logged in user and tells him the directory in which he is.
#!/bin/bash
luser=$(echo $USER)
currdir=`pwd`
echo "Hello ${luser},"
echo ""
echo "You are currently in ${currdir}."
echo ""
The first line of any script should be a line that specifies what interpreter is to be used for this script. This line is commonly known as “hash bang” or “shebang”. The first two characters of this line are #! followed by the path of an interpreter to use. In our examples we will be using the following #!/bin/bash.
Note the $() and
pwd
Now save the file as helloUser.sh and make it executable by typing,
sudo chmod +x helloUser.sh.
Let us run the script in our commandline by typing,
./helloUser.sh and then press enter.
Notice that we used sudo before the chmod
to change the file permissions. The sudo command allows a
permitted user to run a command as the superuser (root user) or another
user, as specified by the security policy on Unix based systems. All
accounts with admin privilleges are also sudo users. On
Linux servers, a new user can be assigned sudo privilleges
by adding the to the file ‘/etc/sudoers’ or by adding the to
the group, “wheel”.
So how then do we run our script if we are not sudoers? we will ask bash to run it for us.
bash helloUser.sh
## Hello sanem,
##
## You are currently in /Users/sanem/temp/CoursesandTrainings/Tutorial_Bash.
Exercise: In 5 groups of 5, work together to determine the core components of a project folder.