Subsetting data in bash

Working with data means you’d be subsetting it at some point, often more then once. While doing it in R is easier if you have a relatively small dataset, working in bash can dramatically speed up things for bigger stuff and make computation process more stable. Here are few tricks and tips on how to subset your data from command line (works on Macs and -nix systems, Windows command line version would need some adjustment)

list.files()

##  [1] "hg19.bed"                                
##  [2] "hg19.sample.tab"                         
##  [3] "hg19.sample.txt"                         
##  [4] "Homo_sapiens.GRCh38.95.chromosome.1.gff3"
##  [5] "output1.txt"                             
##  [6] "output2.txt"                             
##  [7] "rsconnect"                               
##  [8] "sample.ann"                              
##  [9] "Subsetting.html"                         
## [10] "Subsetting.Rmd"

Working with tabulated data

Selecting columns

For column selection, the easiest option is to use cut. Not that we use head to limit the output printed out - head -5 makes it 5 lines.

You can specify numbers of particular columns you want with -f followed by a comma-separated list

cat hg19.sample.txt | cut -f1,2,4 | head

## #name    chrom   txStart
## ENST00000456328.2    chr1    11868
## ENST00000450305.2    chr1    12009
## ENST00000488147.1    chr1    14403
## ENST00000619216.1    chr1    17368
## ENST00000473358.1    chr1    29553
## ENST00000469289.1    chr1    30266
## ENST00000607096.1    chr1    30365
## ENST00000417324.1    chr1    34553
## ENST00000461467.1    chr1    35244

This gives you a beginning of the first, second a nd forth column of the file.

We also use pipe sign (|) to pass data from one command to another, like so: * cat file | head -5 *

files

Since I mostly do bioinformatics analysis I have selected a derivative of a .gff file as a sample. GFF stands for Genetic Feature Format and contains a specially formatted information about genes and other genome features relative to their position in the sequenced genome. The bulk of a gff file is tab-separated and one of the last columns is comma- and semicolon separated. Mammalian genomes - like human and mouse - will have more than 20,000 annotated genes and every gene will have several entries in the annotation file. You cannot parse this manualy! All the more reasons to learn how to subset this in bash. the sample file from this example can be found here To get the same output check if you have selected GRCh38/hg38, Genes and Gene prediction, knownGene and genome in the checkboxes. I took the first 90 lines from the output. ### Selecting a subset of rows is easy with awk, selecting rows from 10th: awk ’NR >= 10 to 20th && NR <= 50 { print }’

cat hg19.sample.txt | awk 'NR >= 10  && NR <= 20 { print }'

## ENST00000461467.1    chr1    -   35244   36073   35244   35244   2   35244,35720,    35481,36073,        uc057aua.1
## ENST00000606857.1    chr1    +   52472   53312   52472   52472   1   52472,  53312,      uc286dmx.1
## ENST00000642116.1    chr1    +   57597   64116   57597   57597   3   57597,58699,62915,  57653,58856,64116,      uc286dmy.1
## ENST00000492842.2    chr1    +   62948   63887   62948   62948   1   62948,  63887,      uc286dmz.1
## ENST00000641515.2    chr1    +   65418   71585   65564   70008   3   65418,65519,69036,  65433,65573,71585,  A0A2U3U0J3  uc001aal.2
## ENST00000335137.4    chr1    +   69054   70108   69090   70008   1   69054,  70108,  Q8NH21  uc285fxb.1
## ENST00000466430.5    chr1    -   89294   120932  89294   89294   4   89294,92090,112699,120774,  91629,92240,112804,120932,      uc057aub.1
## ENST00000495576.1    chr1    -   89550   91105   89550   89550   2   89550,90286,    90050,91105,        uc057auc.1
## ENST00000477740.5    chr1    -   92229   129217  92229   92229   4   92229,112699,120720,129054, 92240,112804,120932,129217,     uc057aud.1
## ENST00000471248.1    chr1    -   110952  129173  110952  110952  3   110952,112699,129054,   111357,112804,129173,       uc057aue.1
## ENST00000610542.1    chr1    -   120724  133723  120724  120724  4   120724,120873,129054,133373,    120869,120932,129223,133723,        uc057auf.1

awk 'NR > 59 { exit } NR >= 10 && NR <= 20' hg19.sample.txt

## ENST00000461467.1    chr1    -   35244   36073   35244   35244   2   35244,35720,    35481,36073,        uc057aua.1
## ENST00000606857.1    chr1    +   52472   53312   52472   52472   1   52472,  53312,      uc286dmx.1
## ENST00000642116.1    chr1    +   57597   64116   57597   57597   3   57597,58699,62915,  57653,58856,64116,      uc286dmy.1
## ENST00000492842.2    chr1    +   62948   63887   62948   62948   1   62948,  63887,      uc286dmz.1
## ENST00000641515.2    chr1    +   65418   71585   65564   70008   3   65418,65519,69036,  65433,65573,71585,  A0A2U3U0J3  uc001aal.2
## ENST00000335137.4    chr1    +   69054   70108   69090   70008   1   69054,  70108,  Q8NH21  uc285fxb.1
## ENST00000466430.5    chr1    -   89294   120932  89294   89294   4   89294,92090,112699,120774,  91629,92240,112804,120932,      uc057aub.1
## ENST00000495576.1    chr1    -   89550   91105   89550   89550   2   89550,90286,    90050,91105,        uc057auc.1
## ENST00000477740.5    chr1    -   92229   129217  92229   92229   4   92229,112699,120720,129054, 92240,112804,120932,129217,     uc057aud.1
## ENST00000471248.1    chr1    -   110952  129173  110952  110952  3   110952,112699,129054,   111357,112804,129173,       uc057aue.1
## ENST00000610542.1    chr1    -   120724  133723  120724  120724  4   120724,120873,129054,133373,    120869,120932,129223,133723,        uc057auf.1

same done with sed:

sed -n 10,20p hg19.sample.txt

## ENST00000461467.1    chr1    -   35244   36073   35244   35244   2   35244,35720,    35481,36073,        uc057aua.1
## ENST00000606857.1    chr1    +   52472   53312   52472   52472   1   52472,  53312,      uc286dmx.1
## ENST00000642116.1    chr1    +   57597   64116   57597   57597   3   57597,58699,62915,  57653,58856,64116,      uc286dmy.1
## ENST00000492842.2    chr1    +   62948   63887   62948   62948   1   62948,  63887,      uc286dmz.1
## ENST00000641515.2    chr1    +   65418   71585   65564   70008   3   65418,65519,69036,  65433,65573,71585,  A0A2U3U0J3  uc001aal.2
## ENST00000335137.4    chr1    +   69054   70108   69090   70008   1   69054,  70108,  Q8NH21  uc285fxb.1
## ENST00000466430.5    chr1    -   89294   120932  89294   89294   4   89294,92090,112699,120774,  91629,92240,112804,120932,      uc057aub.1
## ENST00000495576.1    chr1    -   89550   91105   89550   89550   2   89550,90286,    90050,91105,        uc057auc.1
## ENST00000477740.5    chr1    -   92229   129217  92229   92229   4   92229,112699,120720,129054, 92240,112804,120932,129217,     uc057aud.1
## ENST00000471248.1    chr1    -   110952  129173  110952  110952  3   110952,112699,129054,   111357,112804,129173,       uc057aue.1
## ENST00000610542.1    chr1    -   120724  133723  120724  120724  4   120724,120873,129054,133373,    120869,120932,129223,133723,        uc057auf.1

of with a combination of tail and head commands: head by default selects first 10 lines of the output.

tail -n +10 hg19.sample.txt | head

## ENST00000461467.1    chr1    -   35244   36073   35244   35244   2   35244,35720,    35481,36073,        uc057aua.1
## ENST00000606857.1    chr1    +   52472   53312   52472   52472   1   52472,  53312,      uc286dmx.1
## ENST00000642116.1    chr1    +   57597   64116   57597   57597   3   57597,58699,62915,  57653,58856,64116,      uc286dmy.1
## ENST00000492842.2    chr1    +   62948   63887   62948   62948   1   62948,  63887,      uc286dmz.1
## ENST00000641515.2    chr1    +   65418   71585   65564   70008   3   65418,65519,69036,  65433,65573,71585,  A0A2U3U0J3  uc001aal.2
## ENST00000335137.4    chr1    +   69054   70108   69090   70008   1   69054,  70108,  Q8NH21  uc285fxb.1
## ENST00000466430.5    chr1    -   89294   120932  89294   89294   4   89294,92090,112699,120774,  91629,92240,112804,120932,      uc057aub.1
## ENST00000495576.1    chr1    -   89550   91105   89550   89550   2   89550,90286,    90050,91105,        uc057auc.1
## ENST00000477740.5    chr1    -   92229   129217  92229   92229   4   92229,112699,120720,129054, 92240,112804,120932,129217,     uc057aud.1
## ENST00000471248.1    chr1    -   110952  129173  110952  110952  3   110952,112699,129054,   111357,112804,129173,       uc057aue.1

Conditional subsetting with awk

to print out all the lines, where the position (4th column) of a feature begins at 89550th position:

cat hg19.sample.txt | awk '$4 = 89550' | head -5

## #name chrom strand 89550 txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
## ENST00000456328.2 chr1 + 89550 14409 11868 11868 3 11868,12612,13220, 12227,12721,14409, uc286dmu.1
## ENST00000450305.2 chr1 + 89550 13670 12009 12009 6 12009,12178,12612,12974,13220,13452, 12057,12227,12697,13052,13374,13670, uc286dmv.1
## ENST00000488147.1 chr1 - 89550 29570 14403 14403 11 14403,15004,15795,16606,16857,17232,17605,17914,18267,24737,29533, 14501,15038,15947,16765,17055,17368,17742,18061,18366,24891,29570, uc286dmw.1
## ENST00000619216.1 chr1 - 89550 17436 17368 17368 1 17368, 17436, uc031tla.1

or find everything that is on the positive strand (3th column):

cat hg19.sample.txt | awk '$3 == "+"' | head -5

## ENST00000456328.2    chr1    +   11868   14409   11868   11868   3   11868,12612,13220,  12227,12721,14409,      uc286dmu.1
## ENST00000450305.2    chr1    +   12009   13670   12009   12009   6   12009,12178,12612,12974,13220,13452,    12057,12227,12697,13052,13374,13670,        uc286dmv.1
## ENST00000473358.1    chr1    +   29553   31097   29553   29553   3   29553,30563,30975,  30039,30667,31097,      uc057aty.1
## ENST00000469289.1    chr1    +   30266   31109   30266   30266   2   30266,30975,    30667,31109,        uc057atz.1
## ENST00000607096.1    chr1    +   30365   30503   30365   30365   1   30365,  30503,      uc031tlb.1

Other formats

I took the last column od the annotation file to show how to deal with column separators other than tab, which is taken by default by many commands (for example cut* treats it as a default delimiter)

head sample.ann

## Parent=transcript:ENST00000488147;Name=ENSE00001843071;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001843071;rank=11;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00001935574;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001935574;rank=10;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00002030414;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002030414;rank=9;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003621279;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003621279;rank=8;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003553898;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003553898;rank=7;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003502542;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003502542;rank=6;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003475637;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003475637;rank=5;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003565697;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003565697;rank=4;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003477500;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003477500;rank=3;version=1
## Parent=transcript:ENST00000488147;Name=ENSE00003507205;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003507205;rank=2;version=1

head showed us the first 10 lines of the file, now we want to select some entry types from the annotations and skip the others. Problem is these types are separated by a semicolumn instead of being tabulated.

cut selects columns from the input file. -d specifies the separator (and we select “;”), -f tells the numbers of columns to be selected (in this case, second, third and eights). > redirects output into a new file, “output1.txt”

cut -d ';' -f 2,3,8 sample.ann > output1.txt

Let’s chech what we’ve got:

head output1.txt

## Name=ENSE00001843071;constitutive=1;version=1
## Name=ENSE00001935574;constitutive=1;version=1
## Name=ENSE00002030414;constitutive=1;version=1
## Name=ENSE00003621279;constitutive=1;version=1
## Name=ENSE00003553898;constitutive=1;version=1
## Name=ENSE00003502542;constitutive=1;version=1
## Name=ENSE00003475637;constitutive=1;version=1
## Name=ENSE00003565697;constitutive=1;version=1
## Name=ENSE00003477500;constitutive=1;version=1
## Name=ENSE00003507205;constitutive=1;version=1

Same can be achieved with substituting “;” for tab and using the default run of cut:

cat sample.ann | tr ';' '\t' | cut -f 2,3,8  > output2.txt

Sometimes we need to check last few rows of the file rather then the first few, tail to the resque:

tail -n 4 extracts the last 4 lines

tail -n +2 extracts from the second row onward, allows us to skip the first row:

cut -d ';' -f 2,3,8 sample.ann | tail -n +2 | head

You can also replace white spaces with tab, using tr:

tr ' ' \\t < hg19.sample.txt > hg19.sample.tab

or do same thing with sed:

sed 's/[:blank:]+/,/g' thefile.txt > the_modified_copy.txt

actualy, awk can also take specifiction on what the separation should be:

awk -v OFS="\t" '$1=$1' file1