Here we present some practical examples of how to explore and make fast consults in biological files using basic linux commands. This is the exercises based on the Coursera Course.
Genes
head apple.genes
## MDP0000303933 MDP0000303933 chr1 - 4276 5447 (4276-4368,4423-4542,4733-4911,5321-5447)
## MDP0000223353 MDP0000223353 chr1 + 77339 79628 (77339-77399,77484-77524,77589-77630,78413-78472,78760-78949,79064-79132,79531-79628)
## MDP0000322928 MDP0000322928 chr1 + 103533 103686 (103533-103686)
## MDP0000151845 MDP0000151845 chr1 - 121369 122541 (121369-122541)
## MDP0000307409 MDP0000307409 chr1 - 123810 125906 (123810-125614,125804-125906)
## MDP0000153869 MDP0000153869 chr1 - 135056 135555 (135056-135105,135183-135555)
## MDP0000187420 MDP0000187420 chr1 + 157313 161160 (157313-157315,157499-157606,158467-158554,159874-159969,160189-160495,160597-160651,161086-161160)
## MDP0000286949 MDP0000286949 chr1 - 161876 162460 (161876-162460)
## MDP0000482754 MDP0000482754 chr1 + 178517 181369 (178517-178579,180750-180772,180866-181033,181258-181369)
## MDP0000726869 MDP0000726869 chr1 + 218660 220437 (218660-218785,219796-219894,219981-220058,220207-220437)
The file apple.genome contains a tag with the name of the chromosome and the sequence in next line.
‘>chr1 ATTTTTTTTCCTTTTGTTTTTGCATACAAGTCAATTTCATTCGAAGTTTGGTTGGAT’
The conditions files are like the genes file, but it contains only the genes in that specific condition.
To answer this, we use the fasta file. A fasta file contains a line started with ‘>’ followed by informations/identifier of the sequence and then, in the next lines, the sequence of nucleotides. In order to know how many chromosomes, we can count the number of ‘>’ in the file.
grep -c ">" apple.genome
grep ">" apple.genome | wc -l
## 3
## 3
In the first column we have the name of the genes. We need to filter the unique values, because one gene can have more than one transcript. Then we count it.
cut -f1 apple.genes | uniq | wc -l
cut -f1 apple.genes | sort -u | wc -l
## 5453
## 5453
The number of variants is in the second column.
cut -f2 apple.genes | sort -u | wc -l
## 5456
Here, we count the number of times each gene appears in the first column. If it appears more than once, it is because it has more than one variant. We fix the number one to know the exclusive variants.
cut -f1 apple.genes | uniq -c | grep " 1 " | wc -l
cut -f1 apple.genes | uniq -c | grep -c " 1 "
## 5450
## 5450
We add the parameter ‘-v’ in grep command, so we select all the values different of ’ 1 ’.
cut -f1 apple.genes | uniq -c | grep -v " 1 " | wc -l
## 3
This information is located on the 4th column of the file. We select unique values because it is asked the number of genes.
cut -f1,4 apple.genes | uniq | grep "+" | wc -l
## 2662
cut -f1,4 apple.genes | uniq | grep "-" | wc -l
## 2791
The information of the chromosome is on the 3rd column.
cut -f1,3 apple.genes | uniq | grep "chr1" | wc -l
## 1624
Same than previous question, but as it is asked the number of transcripts, we do not need to use the uniq command.
cut -f1,3 apple.genes | grep "chr1" | wc -l
## 1625
We select the first column of the both files and store in new files to make the comparison. We show only the 3rd column of the output because it shows lines present in both files.
cut -f1 apple.conditionA | sort -u > apple.conditionA1
cut -f1 apple.conditionB | sort -u > apple.conditionB1
comm -1 -2 apple.conditionA1 apple.conditionB1 | wc -l
## 2410
Here we show only the first column of the output. It shows the genes exclusive in the first file.
comm -2 -3 apple.conditionA1 apple.conditionB1 | wc -l
## 1205
Here we show only the second column of the output. It shows the genes exclusive in the second file.
comm -1 -3 apple.conditionA1 apple.conditionB1 | wc -l
## 1243
To compare the three conditions, first we need to compare a pair of conditions, than select what they have in common and compare to the third condition.
cut -f1 apple.conditionC | sort -u > apple.conditionC1
comm -1 -2 apple.conditionA1 apple.conditionB1 > apple.conditionAB
comm -1 -2 apple.conditionAB apple.conditionC1| wc -l
## 1608