Here it is showed some basic Linux commands and how they can be used in Bioinformatics or exploring files. This content is based in two courses I took: “Command Line Tools for Genomic Data Science” (coursera) and “Good Practices in IT for Bioinformatics” (Unicamp).

Basic Commands

The command ‘echo’ prints a message in the screen.

echo Hello
## Hello

We can print the current date by using ‘date’.

date
## qua 17 fev 2021 13:34:58 -03

Directories

In the system, content is organized in files. The files are stored in directories/folders that can be located inside other folders. Each level in the pathway is marked by a slash ‘/’

‘pwd’ - print working directory

We can print our the name of our current location ‘pwd’ (print working directory).

pwd
## /home/natalia/Documentos/Example

Here, we are in the directory ‘home’, sub-directory ‘username’, then sub-directory ‘Documentos’ and finally the current sub-directory ‘Example’.

‘cd’ - change directory

We can navigate from this location to any other folders in the file system by using ‘cd’ that means ‘change directory’. ‘cd .’ means the current direct and ‘cd ..’ can be used to access the parent directory

cd /home/natalia/
cd .
pwd
cd ..
pwd
cd /home/natalia/Documentos/Example
## /home/natalia
## /home

Showing Directory Content

‘ls’ - list files

We can list the files in the current directory using ‘ls’ (list files).

ls
## bear.fasta
## bear.fastq
## bear.sam
## butterfly.fasta
## butterfly.fastq
## butterfly.sam
## CommandLineToolsBioinfo.html
## CommandLineToolsBioinfo.Rmd
## dog.fasta
## dog.fastq
## dog.sam
## fish.fasta
## fish.fastq
## fish.sam
## numbers
## numbers1
## numbers1sort
## numberssort

We can get additional information about the files, like permissions, file type, size and etc, using the argument -l.

ls -l
## total 728
## -rw-rw-r-- 1 natalia natalia   2976 fev  8 17:01 bear.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 bear.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 bear.sam
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.sam
## -rw-rw-r-- 1 natalia natalia 667104 fev 17 13:34 CommandLineToolsBioinfo.html
## -rw-rw-r-- 1 natalia natalia   9471 fev 17 13:34 CommandLineToolsBioinfo.Rmd
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.sam
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.sam
## -rw-rw-r-- 1 natalia natalia    115 fev 10 17:11 numbers
## -rw-rw-r-- 1 natalia natalia    103 fev 11 14:34 numbers1
## -rw-rw-r-- 1 natalia natalia    103 fev 17 13:34 numbers1sort
## -rw-rw-r-- 1 natalia natalia    115 fev 17 13:34 numberssort

If we add the argument -lt we can see the order the files were created.

ls -lt
## total 728
## -rw-rw-r-- 1 natalia natalia   9471 fev 17 13:34 CommandLineToolsBioinfo.Rmd
## -rw-rw-r-- 1 natalia natalia 667104 fev 17 13:34 CommandLineToolsBioinfo.html
## -rw-rw-r-- 1 natalia natalia    103 fev 17 13:34 numbers1sort
## -rw-rw-r-- 1 natalia natalia    115 fev 17 13:34 numberssort
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev 17 13:34 dog.sam
## -rw-rw-r-- 1 natalia natalia    103 fev 11 14:34 numbers1
## -rw-rw-r-- 1 natalia natalia    115 fev 10 17:11 numbers
## -rw-rw-r-- 1 natalia natalia   2976 fev  8 17:01 bear.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 bear.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.fastq
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.fasta
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 bear.sam
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 butterfly.sam
## -rw-rw-r-- 1 natalia natalia      2 fev  6 17:09 fish.sam

Sometimes it is useful to find common patterns shared in the name of the files in order to work with a large number of files. For example, we can show the files starting with a specific string using **’*’**.

ls dog*
## dog.fasta
## dog.fastq
## dog.sam

We can also represent a range of characters using ‘[]’.

ls [a-z]*.fasta
## bear.fasta
## butterfly.fasta
## dog.fasta
## fish.fasta

Or search for files starting with a specific pattern:

ls b*.fasta
## bear.fasta
## butterfly.fasta

Or search for more than one file name:

ls {dog,fish}.sam #without space inside the curly brackets
## dog.sam
## fish.sam

Creating and Removing Content

‘mkdir’ - make directory

To create a directory we use the command ‘mkdir’ (make directory).

mkdir dog

If we want to specify a path, we need to add the argument ‘-p’.

mkdir -p animals/cat

‘tree’

‘tree’ can be used to check all the directories and their hierarchy.

tree
## .
## ├── animals
## │   └── cat
## ├── bear.fasta
## ├── bear.fastq
## ├── bear.sam
## ├── butterfly.fasta
## ├── butterfly.fastq
## ├── butterfly.sam
## ├── CommandLineToolsBioinfo.html
## ├── CommandLineToolsBioinfo.Rmd
## ├── dog
## ├── dog.fasta
## ├── dog.fastq
## ├── dog.sam
## ├── fish.fasta
## ├── fish.fastq
## ├── fish.sam
## ├── numbers
## ├── numbers1
## ├── numbers1sort
## └── numberssort
## 
## 3 directories, 18 files

‘cp’ - copy

We can copy a file using ‘cp’ (copy) and arguments the name of the file and the destination.

cp dog.sam dog
ls dog
## dog.sam

Multiple files can be copied at once:

cp dog.fasta dog.bam dog
ls dog
## cp: cannot stat 'dog.bam': No such file or directory
## dog.fasta
## dog.sam

Or all the files started with a pattern can be copied:

cp dog.* dog
ls dog
## dog.fasta
## dog.fastq
## dog.sam

‘mv’ - move

By using ‘mv’ (move), we can move the content from the current directory to a target directory:

mkdir fish
mv fish.sam fish
mv fish* fish
ls fish
## mv: cannot move 'fish' to a subdirectory of itself, 'fish/fish'
## fish.fasta
## fish.fastq
## fish.sam

‘rm’ - remove

‘rm’ removes files. In order to avoid mistakes, you can use the options ‘-i’ or ‘–interactive’ with the command ‘rm’. If we need to remove a directory containing subdirectories and files we can use the option ‘-r’ (recursive) and -rf to delete a directory and files in a recursive and forced way.

rm dog.*

‘rmdir’ can remove empty directories.

rmdir animals/cat

Exploring File Content

‘more’

Allows only to move foward in the file.

#more bear.fasta

‘less’

Allows to move foward and backward in the file. Press ‘q’ to quit.

less bear.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC

Multi-Fasta File

Note that the fasta file has a specific organization. A line, that is the header started with > followed by the name of the sequence and then, the sequence, in this case a DNA sequence with the letters A, T, C, G representing the nucleotides and N when it is unknown. Each sequence has its header.

‘tail’

‘tail’ can be used simarly but showing the last lines of the file.

‘cat’ - concatenate

‘cat’ can be used to show the content of one file.

cat bear.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC

Or concatenate multiple files:

cat b*.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
## 

‘wc’ - word count

‘wc’ (word count) gives us four numbers. The first represents the number of lines in the file, the second is the number of words, the third is the number of characters and then we have the name of the file.

wc bear.fasta
##    6    6 2976 bear.fasta

If we use ‘-l’ we obtain the number of lines in the file. ‘-c’ counts the characters and ‘-w’ counts the words.

wc -l bear.fasta
wc -c bear.fasta
wc -w bear.fasta
## 6 bear.fasta
## 2976 bear.fasta
## 6 bear.fasta

‘<’, ‘>’, ‘|’

We can use use the symbol ‘>’ to specify an input to the command or ‘>’ to save the output of a command in a file. We also can concatena commands by using ‘|’.

ls | wc -l #Here we can see how many files we have in this directory
## 15

Querying Content

‘cut’

We can use the command ‘cut’ to delimit fields demarked by tabs (default). We can change the delimiters with ‘-d’. For example, we can select columns 1 and 2:

cut -d ' ' -f1,2 numbers
## one 1
## two 2
## three 3
## four 4
## five 5
## six 6
## seven 7
## eight 8
## nine 9
## ten 10

‘sort’

It puts the lines in the file in alphabetical order.

#more numbers
sort numbers
## eight 8 pair
## five 5 odd
## four 4 pair
## nine 9 odd
## one 1 odd
## seven 7 odd
## six 6 pair
## ten 10 pair
## three 3 odd
## two 2 pair

We can use ‘-r’ to sort it by reverse alphabetical order.

sort -r numbers
## two 2 pair
## three 3 odd
## ten 10 pair
## six 6 pair
## seven 7 odd
## one 1 odd
## nine 9 odd
## four 4 pair
## five 5 odd
## eight 8 pair

We can sort by column specifying by ‘-k’.

sort -k2 numbers
## ten 10 pair
## one 1 odd
## two 2 pair
## three 3 odd
## four 4 pair
## five 5 odd
## six 6 pair
## seven 7 odd
## eight 8 pair
## nine 9 odd

If we want to sort by numerical order, we need to specify with ‘n’.

sort -k 2n numbers
## one 1 odd
## two 2 pair
## three 3 odd
## four 4 pair
## five 5 odd
## six 6 pair
## seven 7 odd
## eight 8 pair
## nine 9 odd
## ten 10 pair

Reverse:

sort -k 2nr numbers
## ten 10 pair
## nine 9 odd
## eight 8 pair
## seven 7 odd
## six 6 pair
## five 5 odd
## four 4 pair
## three 3 odd
## two 2 pair
## one 1 odd

Using more than one columns:

sort -k 3 -k 2n numbers
## one 1 odd
## three 3 odd
## five 5 odd
## seven 7 odd
## nine 9 odd
## two 2 pair
## four 4 pair
## six 6 pair
## eight 8 pair
## ten 10 pair

We can use the argument ‘-u’ to call the unique values.

cut -d ' ' -f3 numbers | sort -u
## odd
## pair

‘uniq’

This command can be used to call the unique values, but it considers only pair to pair comparison. If the file is not sorted, it will show repeated values.

cut -d ' ' -f3 numbers | uniq
## odd
## pair
## odd
## pair
## odd
## pair
## odd
## pair
## odd
## pair

In order to get the unique values, we need to sort it first.

cut -d ' ' -f3 numbers | sort| uniq
## odd
## pair

But we can use it to count the number of consecutive times a word appears by using ‘-c’.

cut -d ' ' -f3 numbers | uniq -c
##       1 odd
##       1 pair
##       1 odd
##       1 pair
##       1 odd
##       1 pair
##       1 odd
##       1 pair
##       1 odd
##       1 pair

‘grep’

We can use ‘grep’ to search patterns in files and it shows the line with the pattern. It searches all the files in the directory.

grep odd numbers
## one 1 odd
## three 3 odd
## five 5 odd
## seven 7 odd
## nine 9 odd

If the pattern contains space, we need to delimit it with "". If we want to see the number of the line, we use ‘-n’.

grep -n "3 odd" numbers
## 3:three 3 odd

Comparing Content

‘diff’

This command is used to compare files. It shows 1) lines differents in the first file; 2) what has to be done to make the files identical (a - add, c - modify and d - delete); lines different in the second file. We can use ‘-i’ to ignore differences between upper and lower case; ‘-s’ to delete identical files; ‘-w’ to ignore tabs and spaces.

diff numbers1 numbers

‘sdiff’ does the same but the output is formatted in a table. The argument ‘-s’ is used to hide identical lines.

sdiff -s numbers1 numbers

We have a report showing all the differences between 2 files. So, to make the files identical, we need to insert lines 3 and 4 of the second file in the line 3 of the first; replace lines 7 and 8 by 8 and 9.

‘comm’

Shows three columns: 1) lines exclusive of first file; 2) lines exclusive of second file; 3) lines on both files. Files must be sorted. We can use ‘-1, -2, -3’ in order to hide results we don’t need. In this example, we show only what is common in both files:

sort numbers  > numberssort
sort numbers1 > numbers1sort
comm -1 -2 numberssort numbers1sort
## five 5 odd
## one 1 odd
## seven 7 odd
## six 6 pair
## ten 10 pair
## two 2 pair

Archiving Content

‘gzip’, bzip2

We can use ‘gzip’ to compress a file. ’bzip2 has a more compressive power.

gzip bear.fasta

‘gunzip’, bunzip2

This command is used to extract a file.

gunzip bear.fasta.gz

‘tar’

If we want to compress several files together at once, we need to use ‘tar’. Option ‘c’ indicates it will put the files together, ‘v’ shows the verbose and ‘f’ is to indicate the name of the destiny file.

tar -cvf filescompressed.tar bear.fasta numbers numbers1
bzip2 filescompressed.tar
## bear.fasta
## numbers
## numbers1

To uncompress this file, we need first to use ‘gunzip’ and then, ‘tar’ with the options ‘x’ indicating the extration, ‘v’ to list the files inside the compacted one, and ‘f’ to indicate from which file.

bunzip2 filescompressed.tar.bz2
mkdir filescompressed
mv filescompressed.tar filescompressed
cd filescompressed
tar -xvf filescompressed.tar
## bear.fasta
## numbers
## numbers1

‘zcat’

Allows to look inside a compacted file.

cd filescompressed
gzip filescompressed.tar
zcat filescompressed.tar.gz | head
## bear.fasta
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
## 
## two 2 pair
## three 3 odd
## four 4 pair