Current ngrams from Corpora (all samples)

The “big” number of tokens and the “extended” size of the vocabulary for unigrams, bigrams and trigrams make it challenging to use such data for a “mobile-based” or “web-based” data product. Decision needs to be taken in order to reduce the n-grams to be used.

Summary Info About NGrams
ng noOfEntries_V N memorySize
1 362734 56859890 25.1 Mb
2 6905615 53549688 520 Mb
3 20496858 50239484 1.6 Gb

Trigram

Create a data structure like bigram(w_i-2, w_i-1), next.word(w_i), count.trigram(w_i-2, w_i-1, w_i), count.bigram(w_i-2, w_i-1) and limit it to a certain number of trigrams in order to have a certain coverage.

With a 30% coverage, we limit the trigrams to 229289 over 20496858 observed trigrams.

Bigram Strategy

Limit the size of the bigram model in order to have a 75% coverage of the vocabulary.

Unigram Strategy

Limit the size of the unigram model in order to have a 95% coverage, tme left-out unigrams are replaced by the “OTH” unigram that has the same count as the left out (N does not change, V does change). When looking for an unigram, if the unigram is not found than “OTH” is used.

Summary Info About NGrams - Original Memory Footprint
ng noOfEntries_V N memorySize
1 362734 56859890 25.1 Mb
2 6905615 53549688 520 Mb
3 20496858 50239484 1.6 Gb
Summary Info About NGrams - Reduced Memory Footprint
ng noOfEntries_V N memorySize
1 9132 56859890 1.1 Mb
2 368697 40162258 26.8 Mb
3 229289 15071838 10.2 Mb