The “big” number of tokens and the “extended” size of the vocabulary for unigrams, bigrams and trigrams make it challenging to use such data for a “mobile-based” or “web-based” data product. Decision needs to be taken in order to reduce the n-grams to be used.
| ng | noOfEntries_V | N | memorySize |
|---|---|---|---|
| 1 | 362734 | 56859890 | 25.1 Mb |
| 2 | 6905615 | 53549688 | 520 Mb |
| 3 | 20496858 | 50239484 | 1.6 Gb |
Create a data structure like bigram(w_i-2, w_i-1), next.word(w_i), count.trigram(w_i-2, w_i-1, w_i), count.bigram(w_i-2, w_i-1) and limit it to a certain number of trigrams in order to have a certain coverage.
With a 30% coverage, we limit the trigrams to 229289 over 20496858 observed trigrams.
Limit the size of the bigram model in order to have a 75% coverage of the vocabulary.
Limit the size of the unigram model in order to have a 95% coverage, tme left-out unigrams are replaced by the “OTH” unigram that has the same count as the left out (N does not change, V does change). When looking for an unigram, if the unigram is not found than “OTH” is used.
| ng | noOfEntries_V | N | memorySize |
|---|---|---|---|
| 1 | 362734 | 56859890 | 25.1 Mb |
| 2 | 6905615 | 53549688 | 520 Mb |
| 3 | 20496858 | 50239484 | 1.6 Gb |
| ng | noOfEntries_V | N | memorySize |
|---|---|---|---|
| 1 | 9132 | 56859890 | 1.1 Mb |
| 2 | 368697 | 40162258 | 26.8 Mb |
| 3 | 229289 | 15071838 | 10.2 Mb |