Tokenization for Biological Sequences
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome (2021)(Ji et al. 2021)
They developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts.
Tokenization
k-mer representation (widely used in analyzing DNA sequences)
The k-mer representation incorporates information for each deoxynucleotide base by concatenating it with its following ones. The concatenation of them is called a k-mer.
For example, a DNA sequence ‘ATGGCT’ can be tokenized:
four 3-mers: {ATG, TGG, GGC, GCT}
two 5-mers: {ATGGC, TGGCT}
Different k leads to different tokenization of a DNA sequence.
DNABERT-3, DNABERT- 4, DNABERT-5, DNABERT-6
For DNABERT-k, the vocabulary of it consists of all the permutations of the k-mer as well as 5 special tokens:
[CLS] stands for classification token
[PAD] stands for pad-ding token
[UNK] stands for unknown token
[SEP] stands for separation token
[MASK] stands for masked token
Thus, there are \(4^k+5\) tokens in the vocabulary of DNABERT-k.
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome (2023)(Zhou et al. 2023)
One of limitations of DNABERT: k-mer tokenization resulted in information leakage and overall poor computational efficiency during pre-training.
- Replaces k-mer tokenization with Byte Pair Encoding (BPE)
- DNABERT-2 overcomes the limitation of DNABERT by replacing learned positional embeddings with Attention with Linear Biases (ALiBi) [Press et al., 2021] to get rid of the input length limitation.
During tokenization, a sliding window with window size k and stride t is employed to convert the original genome sequence into a series of k-mers.
Here, the stride t is either set as 1 or k, while the first one represents the overlapping version of k-mer tokenization and the other one represents the non-overlapping version.
Both versions are less optimal.
Overlapping:
The tokenized sequence for an input of length L consists of \(L - k + 1\) tokens, each with a length of k. This results in a tokenized sequence with considerable redundancy and a length nearly equivalent to the original sequence, leading to low computation efficiency.
Non-overlapping:
Despite its advantage of reducing sequence length by a factor of k, is plagued by a notable issue of sample inefficiency.
This minor shift instigates a dramatic alteration in the tokenized output, which is difficult to align distinct representations of identical or near-identical inputs.
Subword tokenization framework
DNABERT-2 adapts SentencePiece [Kudo and Richard- son, 2018] with Byte Pair Encoding (BPE) [Sennrich et al., 2016] to perform tokenization for DNA sequences.
It learns a fixed-sized vocabulary of variable-length tokens based on the co-occurrence frequency of the characters.
Positional embedding:
Existing solutions such as Sinusoidal [Vaswani et al., 2017], learned [Kenton and Toutanova, 2019], and Rotary [Su et al., 2021] positional embedding methods either suffer from input length restriction or poor extrapolation capability when applied to sequences longer than training data. Attention with Linear Biases (ALiBi) provides an efficient yet effective solution. Instead of adding position embeddings to the input, ALiBi adds a fixed set of static, non-learned biases to each attention calculation to incorporate positional information into attention scores.
Nucleotide Transformer(Dalla-Torre et al. 2023)
They used 6-mer tokens as a trade-off between sequence length (up to 6kb) and embedding size, and because it achieved the highest performance when compared with other token lengths.
Version 2: Instead of using learned positional embeddings, they used Rotary Embeddings that are used at each attention layer.