Current ngrams from Corpora (all samples)

The “big” number of tokens and the “extended” size of the vocabulary for unigrams, bigrams and trigrams make it challenging to use such data for a “mobile-based” or “web-based” data product. Decision needs to be taken in order to reduce the n-grams to be used.

Summary Info About NGrams
ng	noOfEntries_V	N	memorySize
1	362734	56859890	25.1 Mb
2	6905615	53549688	520 Mb
3	20496858	50239484	1.6 Gb

Trigram

Create a data structure like bigram(w_i-2, w_i-1), next.word(w_i), count.trigram(w_i-2, w_i-1, w_i), count.bigram(w_i-2, w_i-1) and limit it to a certain number of trigrams in order to have a certain coverage.

With a 30% coverage, we limit the trigrams to 229289 over 20496858 observed trigrams.

Bigram Strategy

Limit the size of the bigram model in order to have a 75% coverage of the vocabulary.

Unigram Strategy

Limit the size of the unigram model in order to have a 95% coverage, tme left-out unigrams are replaced by the “OTH” unigram that has the same count as the left out (N does not change, V does change). When looking for an unigram, if the unigram is not found than “OTH” is used.

Summary Info About NGrams - Original Memory Footprint
ng	noOfEntries_V	N	memorySize
1	362734	56859890	25.1 Mb
2	6905615	53549688	520 Mb
3	20496858	50239484	1.6 Gb

Summary Info About NGrams - Reduced Memory Footprint
ng	noOfEntries_V	N	memorySize
1	9132	56859890	1.1 Mb
2	368697	40162258	26.8 Mb
3	229289	15071838	10.2 Mb

Reducing ngrams (memory footprint)

Pier Lorenzo Paracchini

28 mai 2016

Current ngrams from Corpora (all samples)

Trigram

Bigram Strategy

Unigram Strategy