Basic Linux Command for Bioinformatics - Beginners - Part 2 - EXAMPLES

Here we present some practical examples of how to explore and make fast consults in biological files using basic linux commands. This is the exercises based on the Coursera Course.

Quick look in the files format

Genes

head apple.genes
## MDP0000303933    MDP0000303933   chr1    -   4276    5447    (4276-4368,4423-4542,4733-4911,5321-5447)
## MDP0000223353    MDP0000223353   chr1    +   77339   79628   (77339-77399,77484-77524,77589-77630,78413-78472,78760-78949,79064-79132,79531-79628)
## MDP0000322928    MDP0000322928   chr1    +   103533  103686  (103533-103686)
## MDP0000151845    MDP0000151845   chr1    -   121369  122541  (121369-122541)
## MDP0000307409    MDP0000307409   chr1    -   123810  125906  (123810-125614,125804-125906)
## MDP0000153869    MDP0000153869   chr1    -   135056  135555  (135056-135105,135183-135555)
## MDP0000187420    MDP0000187420   chr1    +   157313  161160  (157313-157315,157499-157606,158467-158554,159874-159969,160189-160495,160597-160651,161086-161160)
## MDP0000286949    MDP0000286949   chr1    -   161876  162460  (161876-162460)
## MDP0000482754    MDP0000482754   chr1    +   178517  181369  (178517-178579,180750-180772,180866-181033,181258-181369)
## MDP0000726869    MDP0000726869   chr1    +   218660  220437  (218660-218785,219796-219894,219981-220058,220207-220437)

The file apple.genome contains a tag with the name of the chromosome and the sequence in next line.

‘>chr1 ATTTTTTTTCCTTTTGTTTTTGCATACAAGTCAATTTCATTCGAAGTTTGGTTGGAT’

The conditions files are like the genes file, but it contains only the genes in that specific condition.

How many chromosomes are there in the genome?

To answer this, we use the fasta file. A fasta file contains a line started with ‘>’ followed by informations/identifier of the sequence and then, in the next lines, the sequence of nucleotides. In order to know how many chromosomes, we can count the number of ‘>’ in the file.

grep -c ">" apple.genome
grep ">" apple.genome | wc -l
## 3
## 3

How many genes?

In the first column we have the name of the genes. We need to filter the unique values, because one gene can have more than one transcript. Then we count it.

cut -f1 apple.genes | uniq | wc -l
cut -f1 apple.genes | sort -u | wc -l
## 5453
## 5453

Number of variants

The number of variants is in the second column.

cut -f2 apple.genes | sort -u | wc -l
## 5456

How many genes have a single splice variant??

Here, we count the number of times each gene appears in the first column. If it appears more than once, it is because it has more than one variant. We fix the number one to know the exclusive variants.

cut -f1 apple.genes | uniq -c | grep " 1 " | wc -l
cut -f1 apple.genes | uniq -c | grep -c " 1 "
## 5450
## 5450

How may genes have 2 or more splice variants?

We add the parameter ‘-v’ in grep command, so we select all the values different of ’ 1 ’.

cut -f1 apple.genes | uniq -c | grep -v " 1 " | wc -l
## 3

How many genes are there on the ‘+’ strand?

This information is located on the 4th column of the file. We select unique values because it is asked the number of genes.

cut -f1,4 apple.genes | uniq | grep "+" | wc -l
## 2662

How many genes are there on the ‘-’ strand?

cut -f1,4 apple.genes | uniq | grep "-" | wc -l
## 2791

How many genes are there on chromosome chr1?

The information of the chromosome is on the 3rd column.

cut -f1,3 apple.genes | uniq | grep "chr1" | wc -l
## 1624

How many transcripts are there on chr1?

Same than previous question, but as it is asked the number of transcripts, we do not need to use the uniq command.

cut -f1,3 apple.genes | grep "chr1" | wc -l
## 1625

How many genes are in common between condition A and condition B?

We select the first column of the both files and store in new files to make the comparison. We show only the 3rd column of the output because it shows lines present in both files.

cut -f1 apple.conditionA | sort -u > apple.conditionA1
cut -f1 apple.conditionB | sort -u > apple.conditionB1
comm -1 -2 apple.conditionA1 apple.conditionB1 | wc -l
## 2410

How many genes are specific to condition A?

Here we show only the first column of the output. It shows the genes exclusive in the first file.

comm -2 -3 apple.conditionA1 apple.conditionB1 | wc -l
## 1205

How many genes are specific to condition B?

Here we show only the second column of the output. It shows the genes exclusive in the second file.

comm -1 -3 apple.conditionA1 apple.conditionB1 | wc -l
## 1243

How many genes are in common to all three conditions?

To compare the three conditions, first we need to compare a pair of conditions, than select what they have in common and compare to the third condition.

cut -f1 apple.conditionC | sort -u > apple.conditionC1
comm -1 -2 apple.conditionA1 apple.conditionB1 > apple.conditionAB
comm -1 -2 apple.conditionAB apple.conditionC1| wc -l
## 1608