Working with data means you’d be subsetting it at some point, often more then once. While doing it in R is easier if you have a relatively small dataset, working in bash can dramatically speed up things for bigger stuff and make computation process more stable. Here are few tricks and tips on how to subset your data from command line (works on Macs and -nix systems, Windows command line version would need some adjustment)
list.files()
## [1] "hg19.bed"
## [2] "hg19.sample.tab"
## [3] "hg19.sample.txt"
## [4] "Homo_sapiens.GRCh38.95.chromosome.1.gff3"
## [5] "output1.txt"
## [6] "output2.txt"
## [7] "rsconnect"
## [8] "sample.ann"
## [9] "Subsetting.html"
## [10] "Subsetting.Rmd"
For column selection, the easiest option is to use cut. Not that we use head to limit the output printed out - head -5 makes it 5 lines.
You can specify numbers of particular columns you want with -f followed by a comma-separated list
cat hg19.sample.txt | cut -f1,2,4 | head
## #name chrom txStart
## ENST00000456328.2 chr1 11868
## ENST00000450305.2 chr1 12009
## ENST00000488147.1 chr1 14403
## ENST00000619216.1 chr1 17368
## ENST00000473358.1 chr1 29553
## ENST00000469289.1 chr1 30266
## ENST00000607096.1 chr1 30365
## ENST00000417324.1 chr1 34553
## ENST00000461467.1 chr1 35244
This gives you a beginning of the first, second a nd forth column of the file.
We also use pipe sign (|) to pass data from one command to another, like so: * cat file | head -5 *
Since I mostly do bioinformatics analysis I have selected a derivative of a .gff file as a sample. GFF stands for Genetic Feature Format and contains a specially formatted information about genes and other genome features relative to their position in the sequenced genome. The bulk of a gff file is tab-separated and one of the last columns is comma- and semicolon separated. Mammalian genomes - like human and mouse - will have more than 20,000 annotated genes and every gene will have several entries in the annotation file. You cannot parse this manualy! All the more reasons to learn how to subset this in bash. the sample file from this example can be found here To get the same output check if you have selected GRCh38/hg38, Genes and Gene prediction, knownGene and genome in the checkboxes. I took the first 90 lines from the output. ### Selecting a subset of rows is easy with awk, selecting rows from 10th: awk ’NR >= 10 to 20th && NR <= 50 { print }’
cat hg19.sample.txt | awk 'NR >= 10 && NR <= 20 { print }'
## ENST00000461467.1 chr1 - 35244 36073 35244 35244 2 35244,35720, 35481,36073, uc057aua.1
## ENST00000606857.1 chr1 + 52472 53312 52472 52472 1 52472, 53312, uc286dmx.1
## ENST00000642116.1 chr1 + 57597 64116 57597 57597 3 57597,58699,62915, 57653,58856,64116, uc286dmy.1
## ENST00000492842.2 chr1 + 62948 63887 62948 62948 1 62948, 63887, uc286dmz.1
## ENST00000641515.2 chr1 + 65418 71585 65564 70008 3 65418,65519,69036, 65433,65573,71585, A0A2U3U0J3 uc001aal.2
## ENST00000335137.4 chr1 + 69054 70108 69090 70008 1 69054, 70108, Q8NH21 uc285fxb.1
## ENST00000466430.5 chr1 - 89294 120932 89294 89294 4 89294,92090,112699,120774, 91629,92240,112804,120932, uc057aub.1
## ENST00000495576.1 chr1 - 89550 91105 89550 89550 2 89550,90286, 90050,91105, uc057auc.1
## ENST00000477740.5 chr1 - 92229 129217 92229 92229 4 92229,112699,120720,129054, 92240,112804,120932,129217, uc057aud.1
## ENST00000471248.1 chr1 - 110952 129173 110952 110952 3 110952,112699,129054, 111357,112804,129173, uc057aue.1
## ENST00000610542.1 chr1 - 120724 133723 120724 120724 4 120724,120873,129054,133373, 120869,120932,129223,133723, uc057auf.1
awk 'NR > 59 { exit } NR >= 10 && NR <= 20' hg19.sample.txt
## ENST00000461467.1 chr1 - 35244 36073 35244 35244 2 35244,35720, 35481,36073, uc057aua.1
## ENST00000606857.1 chr1 + 52472 53312 52472 52472 1 52472, 53312, uc286dmx.1
## ENST00000642116.1 chr1 + 57597 64116 57597 57597 3 57597,58699,62915, 57653,58856,64116, uc286dmy.1
## ENST00000492842.2 chr1 + 62948 63887 62948 62948 1 62948, 63887, uc286dmz.1
## ENST00000641515.2 chr1 + 65418 71585 65564 70008 3 65418,65519,69036, 65433,65573,71585, A0A2U3U0J3 uc001aal.2
## ENST00000335137.4 chr1 + 69054 70108 69090 70008 1 69054, 70108, Q8NH21 uc285fxb.1
## ENST00000466430.5 chr1 - 89294 120932 89294 89294 4 89294,92090,112699,120774, 91629,92240,112804,120932, uc057aub.1
## ENST00000495576.1 chr1 - 89550 91105 89550 89550 2 89550,90286, 90050,91105, uc057auc.1
## ENST00000477740.5 chr1 - 92229 129217 92229 92229 4 92229,112699,120720,129054, 92240,112804,120932,129217, uc057aud.1
## ENST00000471248.1 chr1 - 110952 129173 110952 110952 3 110952,112699,129054, 111357,112804,129173, uc057aue.1
## ENST00000610542.1 chr1 - 120724 133723 120724 120724 4 120724,120873,129054,133373, 120869,120932,129223,133723, uc057auf.1
same done with sed:
sed -n 10,20p hg19.sample.txt
## ENST00000461467.1 chr1 - 35244 36073 35244 35244 2 35244,35720, 35481,36073, uc057aua.1
## ENST00000606857.1 chr1 + 52472 53312 52472 52472 1 52472, 53312, uc286dmx.1
## ENST00000642116.1 chr1 + 57597 64116 57597 57597 3 57597,58699,62915, 57653,58856,64116, uc286dmy.1
## ENST00000492842.2 chr1 + 62948 63887 62948 62948 1 62948, 63887, uc286dmz.1
## ENST00000641515.2 chr1 + 65418 71585 65564 70008 3 65418,65519,69036, 65433,65573,71585, A0A2U3U0J3 uc001aal.2
## ENST00000335137.4 chr1 + 69054 70108 69090 70008 1 69054, 70108, Q8NH21 uc285fxb.1
## ENST00000466430.5 chr1 - 89294 120932 89294 89294 4 89294,92090,112699,120774, 91629,92240,112804,120932, uc057aub.1
## ENST00000495576.1 chr1 - 89550 91105 89550 89550 2 89550,90286, 90050,91105, uc057auc.1
## ENST00000477740.5 chr1 - 92229 129217 92229 92229 4 92229,112699,120720,129054, 92240,112804,120932,129217, uc057aud.1
## ENST00000471248.1 chr1 - 110952 129173 110952 110952 3 110952,112699,129054, 111357,112804,129173, uc057aue.1
## ENST00000610542.1 chr1 - 120724 133723 120724 120724 4 120724,120873,129054,133373, 120869,120932,129223,133723, uc057auf.1
of with a combination of tail and head commands: head by default selects first 10 lines of the output.
tail -n +10 hg19.sample.txt | head
## ENST00000461467.1 chr1 - 35244 36073 35244 35244 2 35244,35720, 35481,36073, uc057aua.1
## ENST00000606857.1 chr1 + 52472 53312 52472 52472 1 52472, 53312, uc286dmx.1
## ENST00000642116.1 chr1 + 57597 64116 57597 57597 3 57597,58699,62915, 57653,58856,64116, uc286dmy.1
## ENST00000492842.2 chr1 + 62948 63887 62948 62948 1 62948, 63887, uc286dmz.1
## ENST00000641515.2 chr1 + 65418 71585 65564 70008 3 65418,65519,69036, 65433,65573,71585, A0A2U3U0J3 uc001aal.2
## ENST00000335137.4 chr1 + 69054 70108 69090 70008 1 69054, 70108, Q8NH21 uc285fxb.1
## ENST00000466430.5 chr1 - 89294 120932 89294 89294 4 89294,92090,112699,120774, 91629,92240,112804,120932, uc057aub.1
## ENST00000495576.1 chr1 - 89550 91105 89550 89550 2 89550,90286, 90050,91105, uc057auc.1
## ENST00000477740.5 chr1 - 92229 129217 92229 92229 4 92229,112699,120720,129054, 92240,112804,120932,129217, uc057aud.1
## ENST00000471248.1 chr1 - 110952 129173 110952 110952 3 110952,112699,129054, 111357,112804,129173, uc057aue.1
to print out all the lines, where the position (4th column) of a feature begins at 89550th position:
cat hg19.sample.txt | awk '$4 = 89550' | head -5
## #name chrom strand 89550 txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
## ENST00000456328.2 chr1 + 89550 14409 11868 11868 3 11868,12612,13220, 12227,12721,14409, uc286dmu.1
## ENST00000450305.2 chr1 + 89550 13670 12009 12009 6 12009,12178,12612,12974,13220,13452, 12057,12227,12697,13052,13374,13670, uc286dmv.1
## ENST00000488147.1 chr1 - 89550 29570 14403 14403 11 14403,15004,15795,16606,16857,17232,17605,17914,18267,24737,29533, 14501,15038,15947,16765,17055,17368,17742,18061,18366,24891,29570, uc286dmw.1
## ENST00000619216.1 chr1 - 89550 17436 17368 17368 1 17368, 17436, uc031tla.1
or find everything that is on the positive strand (3th column):
cat hg19.sample.txt | awk '$3 == "+"' | head -5
## ENST00000456328.2 chr1 + 11868 14409 11868 11868 3 11868,12612,13220, 12227,12721,14409, uc286dmu.1
## ENST00000450305.2 chr1 + 12009 13670 12009 12009 6 12009,12178,12612,12974,13220,13452, 12057,12227,12697,13052,13374,13670, uc286dmv.1
## ENST00000473358.1 chr1 + 29553 31097 29553 29553 3 29553,30563,30975, 30039,30667,31097, uc057aty.1
## ENST00000469289.1 chr1 + 30266 31109 30266 30266 2 30266,30975, 30667,31109, uc057atz.1
## ENST00000607096.1 chr1 + 30365 30503 30365 30365 1 30365, 30503, uc031tlb.1
I took the last column od the annotation file to show how to deal with column separators other than tab, which is taken by default by many commands (for example cut* treats it as a default delimiter)
head sample.ann
## Parent=transcript:ENST00000488147;Name=ENSE00001843071;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001843071;rank=11;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00001935574;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001935574;rank=10;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00002030414;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002030414;rank=9;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003621279;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003621279;rank=8;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003553898;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003553898;rank=7;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003502542;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003502542;rank=6;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003475637;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003475637;rank=5;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003565697;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003565697;rank=4;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003477500;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003477500;rank=3;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003507205;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003507205;rank=2;version=1
head showed us the first 10 lines of the file, now we want to select some entry types from the annotations and skip the others. Problem is these types are separated by a semicolumn instead of being tabulated.
cut selects columns from the input file. -d specifies the separator (and we select “;”), -f tells the numbers of columns to be selected (in this case, second, third and eights). > redirects output into a new file, “output1.txt”
cut -d ';' -f 2,3,8 sample.ann > output1.txt
Let’s chech what we’ve got:
head output1.txt
## Name=ENSE00001843071;constitutive=1;version=1
## Name=ENSE00001935574;constitutive=1;version=1
## Name=ENSE00002030414;constitutive=1;version=1
## Name=ENSE00003621279;constitutive=1;version=1
## Name=ENSE00003553898;constitutive=1;version=1
## Name=ENSE00003502542;constitutive=1;version=1
## Name=ENSE00003475637;constitutive=1;version=1
## Name=ENSE00003565697;constitutive=1;version=1
## Name=ENSE00003477500;constitutive=1;version=1
## Name=ENSE00003507205;constitutive=1;version=1
Same can be achieved with substituting “;” for tab and using the default run of cut:
cat sample.ann | tr ';' '\t' | cut -f 2,3,8 > output2.txt
Sometimes we need to check last few rows of the file rather then the first few, tail to the resque:
tail -n 4 extracts the last 4 lines
tail -n +2 extracts from the second row onward, allows us to skip the first row:
cut -d ';' -f 2,3,8 sample.ann | tail -n +2 | head
You can also replace white spaces with tab, using tr:
tr ' ' \\t < hg19.sample.txt > hg19.sample.tab
or do same thing with sed:
sed 's/[:blank:]+/,/g' thefile.txt > the_modified_copy.txt
actualy, awk can also take specifiction on what the separation should be:
awk -v OFS="\t" '$1=$1' file1