Here it is showed some basic Linux commands and how they can be used in Bioinformatics or exploring files. This content is based in two courses I took: “Command Line Tools for Genomic Data Science” (coursera) and “Good Practices in IT for Bioinformatics” (Unicamp).
The command ‘echo’ prints a message in the screen.
echo Hello
## Hello
We can print the current date by using ‘date’.
date
## qua 17 fev 2021 13:34:58 -03
In the system, content is organized in files. The files are stored in directories/folders that can be located inside other folders. Each level in the pathway is marked by a slash ‘/’
We can print our the name of our current location ‘pwd’ (print working directory).
pwd
## /home/natalia/Documentos/Example
Here, we are in the directory ‘home’, sub-directory ‘username’, then sub-directory ‘Documentos’ and finally the current sub-directory ‘Example’.
We can navigate from this location to any other folders in the file system by using ‘cd’ that means ‘change directory’. ‘cd .’ means the current direct and ‘cd ..’ can be used to access the parent directory
cd /home/natalia/
cd .
pwd
cd ..
pwd
cd /home/natalia/Documentos/Example
## /home/natalia
## /home
We can list the files in the current directory using ‘ls’ (list files).
ls
## bear.fasta
## bear.fastq
## bear.sam
## butterfly.fasta
## butterfly.fastq
## butterfly.sam
## CommandLineToolsBioinfo.html
## CommandLineToolsBioinfo.Rmd
## dog.fasta
## dog.fastq
## dog.sam
## fish.fasta
## fish.fastq
## fish.sam
## numbers
## numbers1
## numbers1sort
## numberssort
We can get additional information about the files, like permissions, file type, size and etc, using the argument -l.
ls -l
## total 728
## -rw-rw-r-- 1 natalia natalia 2976 fev 8 17:01 bear.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 bear.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 bear.sam
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.sam
## -rw-rw-r-- 1 natalia natalia 667104 fev 17 13:34 CommandLineToolsBioinfo.html
## -rw-rw-r-- 1 natalia natalia 9471 fev 17 13:34 CommandLineToolsBioinfo.Rmd
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.sam
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.sam
## -rw-rw-r-- 1 natalia natalia 115 fev 10 17:11 numbers
## -rw-rw-r-- 1 natalia natalia 103 fev 11 14:34 numbers1
## -rw-rw-r-- 1 natalia natalia 103 fev 17 13:34 numbers1sort
## -rw-rw-r-- 1 natalia natalia 115 fev 17 13:34 numberssort
If we add the argument -lt we can see the order the files were created.
ls -lt
## total 728
## -rw-rw-r-- 1 natalia natalia 9471 fev 17 13:34 CommandLineToolsBioinfo.Rmd
## -rw-rw-r-- 1 natalia natalia 667104 fev 17 13:34 CommandLineToolsBioinfo.html
## -rw-rw-r-- 1 natalia natalia 103 fev 17 13:34 numbers1sort
## -rw-rw-r-- 1 natalia natalia 115 fev 17 13:34 numberssort
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 17 13:34 dog.sam
## -rw-rw-r-- 1 natalia natalia 103 fev 11 14:34 numbers1
## -rw-rw-r-- 1 natalia natalia 115 fev 10 17:11 numbers
## -rw-rw-r-- 1 natalia natalia 2976 fev 8 17:01 bear.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 bear.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.fastq
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.fasta
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 bear.sam
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 butterfly.sam
## -rw-rw-r-- 1 natalia natalia 2 fev 6 17:09 fish.sam
Sometimes it is useful to find common patterns shared in the name of the files in order to work with a large number of files. For example, we can show the files starting with a specific string using **’*’**.
ls dog*
## dog.fasta
## dog.fastq
## dog.sam
We can also represent a range of characters using ‘[]’.
ls [a-z]*.fasta
## bear.fasta
## butterfly.fasta
## dog.fasta
## fish.fasta
Or search for files starting with a specific pattern:
ls b*.fasta
## bear.fasta
## butterfly.fasta
Or search for more than one file name:
ls {dog,fish}.sam #without space inside the curly brackets
## dog.sam
## fish.sam
To create a directory we use the command ‘mkdir’ (make directory).
mkdir dog
If we want to specify a path, we need to add the argument ‘-p’.
mkdir -p animals/cat
‘tree’ can be used to check all the directories and their hierarchy.
tree
## [01;34m.[00m
## ├── [01;34manimals[00m
## │ └── [01;34mcat[00m
## ├── bear.fasta
## ├── bear.fastq
## ├── bear.sam
## ├── butterfly.fasta
## ├── butterfly.fastq
## ├── butterfly.sam
## ├── CommandLineToolsBioinfo.html
## ├── CommandLineToolsBioinfo.Rmd
## ├── [01;34mdog[00m
## ├── dog.fasta
## ├── dog.fastq
## ├── dog.sam
## ├── fish.fasta
## ├── fish.fastq
## ├── fish.sam
## ├── numbers
## ├── numbers1
## ├── numbers1sort
## └── numberssort
##
## 3 directories, 18 files
We can copy a file using ‘cp’ (copy) and arguments the name of the file and the destination.
cp dog.sam dog
ls dog
## dog.sam
Multiple files can be copied at once:
cp dog.fasta dog.bam dog
ls dog
## cp: cannot stat 'dog.bam': No such file or directory
## dog.fasta
## dog.sam
Or all the files started with a pattern can be copied:
cp dog.* dog
ls dog
## dog.fasta
## dog.fastq
## dog.sam
By using ‘mv’ (move), we can move the content from the current directory to a target directory:
mkdir fish
mv fish.sam fish
mv fish* fish
ls fish
## mv: cannot move 'fish' to a subdirectory of itself, 'fish/fish'
## fish.fasta
## fish.fastq
## fish.sam
‘rm’ removes files. In order to avoid mistakes, you can use the options ‘-i’ or ‘–interactive’ with the command ‘rm’. If we need to remove a directory containing subdirectories and files we can use the option ‘-r’ (recursive) and -rf to delete a directory and files in a recursive and forced way.
rm dog.*
‘rmdir’ can remove empty directories.
rmdir animals/cat
Allows only to move foward in the file.
#more bear.fasta
Allows to move foward and backward in the file. Press ‘q’ to quit.
less bear.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
Note that the fasta file has a specific organization. A line, that is the header started with > followed by the name of the sequence and then, the sequence, in this case a DNA sequence with the letters A, T, C, G representing the nucleotides and N when it is unknown. Each sequence has its header.
‘head’ shows the 10 top lines of the file. If we want to show X lines, we can use ‘-’ and the number of lines.
head -5 bear.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
‘tail’ can be used simarly but showing the last lines of the file.
‘cat’ can be used to show the content of one file.
cat bear.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
Or concatenate multiple files:
cat b*.fasta
## >seq1
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
##
‘wc’ (word count) gives us four numbers. The first represents the number of lines in the file, the second is the number of words, the third is the number of characters and then we have the name of the file.
wc bear.fasta
## 6 6 2976 bear.fasta
If we use ‘-l’ we obtain the number of lines in the file. ‘-c’ counts the characters and ‘-w’ counts the words.
wc -l bear.fasta
wc -c bear.fasta
wc -w bear.fasta
## 6 bear.fasta
## 2976 bear.fasta
## 6 bear.fasta
We can use use the symbol ‘>’ to specify an input to the command or ‘>’ to save the output of a command in a file. We also can concatena commands by using ‘|’.
ls | wc -l #Here we can see how many files we have in this directory
## 15
We can use the command ‘cut’ to delimit fields demarked by tabs (default). We can change the delimiters with ‘-d’. For example, we can select columns 1 and 2:
cut -d ' ' -f1,2 numbers
## one 1
## two 2
## three 3
## four 4
## five 5
## six 6
## seven 7
## eight 8
## nine 9
## ten 10
It puts the lines in the file in alphabetical order.
#more numbers
sort numbers
## eight 8 pair
## five 5 odd
## four 4 pair
## nine 9 odd
## one 1 odd
## seven 7 odd
## six 6 pair
## ten 10 pair
## three 3 odd
## two 2 pair
We can use ‘-r’ to sort it by reverse alphabetical order.
sort -r numbers
## two 2 pair
## three 3 odd
## ten 10 pair
## six 6 pair
## seven 7 odd
## one 1 odd
## nine 9 odd
## four 4 pair
## five 5 odd
## eight 8 pair
We can sort by column specifying by ‘-k’.
sort -k2 numbers
## ten 10 pair
## one 1 odd
## two 2 pair
## three 3 odd
## four 4 pair
## five 5 odd
## six 6 pair
## seven 7 odd
## eight 8 pair
## nine 9 odd
If we want to sort by numerical order, we need to specify with ‘n’.
sort -k 2n numbers
## one 1 odd
## two 2 pair
## three 3 odd
## four 4 pair
## five 5 odd
## six 6 pair
## seven 7 odd
## eight 8 pair
## nine 9 odd
## ten 10 pair
Reverse:
sort -k 2nr numbers
## ten 10 pair
## nine 9 odd
## eight 8 pair
## seven 7 odd
## six 6 pair
## five 5 odd
## four 4 pair
## three 3 odd
## two 2 pair
## one 1 odd
Using more than one columns:
sort -k 3 -k 2n numbers
## one 1 odd
## three 3 odd
## five 5 odd
## seven 7 odd
## nine 9 odd
## two 2 pair
## four 4 pair
## six 6 pair
## eight 8 pair
## ten 10 pair
We can use the argument ‘-u’ to call the unique values.
cut -d ' ' -f3 numbers | sort -u
## odd
## pair
This command can be used to call the unique values, but it considers only pair to pair comparison. If the file is not sorted, it will show repeated values.
cut -d ' ' -f3 numbers | uniq
## odd
## pair
## odd
## pair
## odd
## pair
## odd
## pair
## odd
## pair
In order to get the unique values, we need to sort it first.
cut -d ' ' -f3 numbers | sort| uniq
## odd
## pair
But we can use it to count the number of consecutive times a word appears by using ‘-c’.
cut -d ' ' -f3 numbers | uniq -c
## 1 odd
## 1 pair
## 1 odd
## 1 pair
## 1 odd
## 1 pair
## 1 odd
## 1 pair
## 1 odd
## 1 pair
We can use ‘grep’ to search patterns in files and it shows the line with the pattern. It searches all the files in the directory.
grep odd numbers
## one 1 odd
## three 3 odd
## five 5 odd
## seven 7 odd
## nine 9 odd
If the pattern contains space, we need to delimit it with "". If we want to see the number of the line, we use ‘-n’.
grep -n "3 odd" numbers
## 3:three 3 odd
This command is used to compare files. It shows 1) lines differents in the first file; 2) what has to be done to make the files identical (a - add, c - modify and d - delete); lines different in the second file. We can use ‘-i’ to ignore differences between upper and lower case; ‘-s’ to delete identical files; ‘-w’ to ignore tabs and spaces.
diff numbers1 numbers
‘sdiff’ does the same but the output is formatted in a table. The argument ‘-s’ is used to hide identical lines.
sdiff -s numbers1 numbers
We have a report showing all the differences between 2 files. So, to make the files identical, we need to insert lines 3 and 4 of the second file in the line 3 of the first; replace lines 7 and 8 by 8 and 9.
Shows three columns: 1) lines exclusive of first file; 2) lines exclusive of second file; 3) lines on both files. Files must be sorted. We can use ‘-1, -2, -3’ in order to hide results we don’t need. In this example, we show only what is common in both files:
sort numbers > numberssort
sort numbers1 > numbers1sort
comm -1 -2 numberssort numbers1sort
## five 5 odd
## one 1 odd
## seven 7 odd
## six 6 pair
## ten 10 pair
## two 2 pair
We can use ‘gzip’ to compress a file. ’bzip2 has a more compressive power.
gzip bear.fasta
This command is used to extract a file.
gunzip bear.fasta.gz
If we want to compress several files together at once, we need to use ‘tar’. Option ‘c’ indicates it will put the files together, ‘v’ shows the verbose and ‘f’ is to indicate the name of the destiny file.
tar -cvf filescompressed.tar bear.fasta numbers numbers1
bzip2 filescompressed.tar
## bear.fasta
## numbers
## numbers1
To uncompress this file, we need first to use ‘gunzip’ and then, ‘tar’ with the options ‘x’ indicating the extration, ‘v’ to list the files inside the compacted one, and ‘f’ to indicate from which file.
bunzip2 filescompressed.tar.bz2
mkdir filescompressed
mv filescompressed.tar filescompressed
cd filescompressed
tar -xvf filescompressed.tar
## bear.fasta
## numbers
## numbers1
Allows to look inside a compacted file.
cd filescompressed
gzip filescompressed.tar
zcat filescompressed.tar.gz | head
## bear.fasta
## AGTNCTGCNACGANTNGACTCAGNTCGTNAGNTCACATNGCTANGNGCATCNTGANCGTACGATNNATGCGCNATCATGNNCGATTCANGACTNGGNACTNTACGACGNTCNAGTNGTACAGCNTCGTANGCTNACTGNATANCGNTCAGATNCGCAGTNGCNTANCAGTGATCNGTANCGANTCGCATNGACNTCGNTAACNGTCNTAGTGANCAGNCTGCTANANTGCGANCTTNAGCACNTGATGCNGNCATCTGANCNGATNTGACNTGCAACGTNNCATGGACTNATCGNGTNACGTNCATCGANNGATCATNGCGTCANCNATGTACNGATGNCNTAGCGNCTATANGCGTACNGTCNANACGTTNGACNCTAGCANTGCNGTAANGTCTGNCATNACGNCTGAGCANTGATNCTACGNTNCGATNCAGTGNACCGNATGNTACTCGNAANCGTACTGNTCAGNNACTGANGCTAGTCNNAGCTCANGTATCNGAGNTCNGCTACTNAGNTCGACTAGNCTNGAANTCGTGCANNATCGTCNAGGNATCNGTCAANCTGTGACNTAGNCTCNGANAGTCTNGCAAGCTNTAGCN
## >seq2
## TATNATAGCCCACATTTCCGNAATCCNCATATGCCAACANTTGTGCTNTATAACCNCTAATCAGTCTGATACCCNTAACCNTATTGACCTAACTGNCATTNCTACAGTCACAACCTNAGTTANGCCTACATTTNCACTTACGATACCCGTANATCATAATTNCCGATTCGANCATCACCAGCTNTTACGTCAACANTTATAACTTNGCCCACCTGNTAATTTAANCGTACCATCCGAACTNTATGCANCCTATCTCACATANGTCNCATAGTATCCCGATTCNAATCNCTAAGCTATCAGCNCTTAATAACTGTCATCNACTANACTTGCACACCTATTNGNTGACCCAATTTCTACATANGCGTTCANCCATATCCTCTNGAAACAANGCCTTTACTAAGTATCCNATTCCNAAGTCGNTACTATACCTCTNAAACCGTACCTGCTNAATCACTCGANTTACTCATCTAANGTACTCGATNCAAGCCTTACNATGTTCCANCTAATCGCAATTCNATCTCCGNAAATAGCCCAATTTNCTTNAACGCATTCAGNCACATTGCCCNAATTTATCAANTATGCCNATCCGATATCANCTTCAGTCAGCATCTAACTNTACNTCTAAGCANATTGCACCTCAANCTCTGATTACNATTCGACATACTTGANCCTGCATCACATNACCTNAGCTATTCACTAATGNCATGACCTTACNACCTTANTAGCAACCNCTATGTAAGATTNTCCCTAGCAATCCTNAACTCATNGCTCATTAANTCGCTTACTANCACGTAATACTCCNGTACCTNATGACTAGNATATCCCATGCTNACACTTTCNAATACGCATCGCTACTNATACACGTATCNTATCNGCTACATTTCAAGCANCAGATCCNTTACCNTTCTCAAGANGCACATATCTCATNTGTCACATNCAATACCGTNTTGAACACTCACANTTTACCGTTCCCANATGAACTNTGCTCAATCCCAANATGTCAGTTCCATANTCACNAAGTCTCAGNTTACTCATACACCGNTATATTCNACACGTTCNATACTGACATCANTACTCGCTGACCTTAANTAACCTNGAC
## >seq3
## NTTAGACACNCGATATCATGAATNCCANAAAGCTTCCAGTTANCAGAANTTCCATNACCAATGTAACNCTGATGANCAACTTANCACTGATAAGCACNTATAAGCCNTATACGANCTTCTACGAANNCACAGTTACCANTAATGGCANCATATTCCGANATATAANACCTGTAANGCACTTCTACAGANACNTAAGCTAAGTANTCCTNAACACGTTATNAACGCGTNCAATACATAGCATCNATANGACTCNAAATTCCGNATACCAGTCCTAGTNAAACAATNCGTAGTCATCANGCCTAAANTNGCCATATATACGCNTAAGAATCCANTACAAGNTTCNATCTAGACCTGANATACAGCNATTACTATNGAACCATACTCNAGGATCAANTCNAGAACTCTAAGCTCATNANCTACGTACGATCTAANCTCAANGATACCAGTATNACCATGTNATTACACNGACTAANACGTCNAAAGTCTATANCCGTAATAGCCATNANTACCATGTTCAGCANAACTACATNGTCANCAGATTCNCAAATGATCAATNGCNCTAATGCAATCTACGANATAAGTCCNNCACATGATACCAATTNGANGACTACTCANTGATCACTACAGNATCANGAATTCCANGTACATCACAGNTTAACTNACAGTTACAATGNCCGTTCAANACACAANTTGCTCNATGAATAAACGTCNGTAANCTACCGTAANTCAAANTCGATCATACACTGNNCACATAGTAAGNTATCCCTANACATGGACTAACTNCACATATGNNAATCTGCACNCTAGATAGCTTACAANAACTTCANGTNAGCTAACTATCNCAGAGAACTATNCCGATNTACAAAACTNGCTTCNCGTAAACGANTACATCTAACNTAGNATCAGTACAATATNCCGCTCNAAGTACTAGTCANATACATGACNTATCAANGCACATGTANCCTGCTNAAACCTATAANGCTTAAACNGTTCAACNGAGAACCANTTNAATAGCTCNTTCCGAAAANCTAACGTAGTACNTACATGTACCANTCGTANCAATAGTNCCAATACANTGCAACAACGNTTCNAATAGCTTNAACTACGAAATNGCCTNTTAACGCANAGTACTCATNAAGTACCCGNACTATANGAACTTACCTAACGTNAATNCAGACTAGCTCANTATAAGATCNCTCAGAANTCACNAGTTACCANGATCTAACCTNTAAGTACTGAACNAGTNAATCCCNATC
##
## two 2 pair
## three 3 odd
## four 4 pair