Mahesh Divakaran
2022-08-22
Get 2-grams and 3-grams (with stopwords).
To reduce the N-gram dictionary size, first calculate frequency for each N-gram, then abandon the least frequent ones (the long tail), say the ones only cover 10% of occurrences or the ones that only appear once in the text corpus.
E.g. The total count of 1-gram is around 540,000. We would only need 6,000 words to cover 90% of the occurrences.
The Shiny app uses 3-gram dictionary (ommiting 3-grams that appears only once in the text corpus). It will match the last two words of an input with the first two words of entries in the dictionary, to predict the third word. If no entries found, it will instead match the last word of the input only. If no entries found again, it will return the most frequent 3-grams as result.
You can launch the app:
online https://datascience9.shinyapps.io/capstone/ or locally by running the following code in your RStudio
There are around 54,000 1-grams (different words) in total. And the no-tail 3-gram dictionary has about 4,060,000 entries, the count of unique first word of 3-grams is around 540,000.
Hence 53766/537782 = 10% words of the text corpus are covered. The sum of the 1-gram occurencies is 68064165. The sum of that covered by the 3-gram first words is .
Hence 66520911/68064165 = 97.73% word occurencies are covered.