Wenjing Liu
2019-04-23
Or we could use libraries to tokenize the text (omitting stopwords). For twitter text we could use function tokenize_tweets().
library(tokenizers)
library(stopwords)
tokenize_words(<text>, stopwords=stopwords::stopwords("en"))
For More info: https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html
Get 2-grams and 3-grams (with stopwords).
tokenize_ngrams(<text>, n_min=2, n=3)
To reduce the N-gram dictionary size, first calculate frequency for each N-gram, then abandon the least frequent ones (the long tail), say the ones only cover 10% of occurrences or the ones that only appear once in the text corpus.
E.g. The total count of 1-gram is around 540,000. We would only need 6,000 words to cover 90% of the occurrences.
Use Twitter text as an example.
The Shiny app uses 3-gram dictionary (ommiting 3-grams that appears only once in the text corpus). It will match the last two words of an input with the first two words of entries in the dictionary, to predict the third word. If no entries found, it will instead match the last word of the input only. If no entries found again, it will return the most frequent 3-grams as result.
You can launch the app:
library(shiny)
runGitHub("Shiny-Text_Input_Prediction-V2", "Nov05")
There are around 54,000 1-grams (different words) in total. And the no-tail 3-gram dictionary has about 4,060,000 entries, the count of unique first word of 3-grams is around 540,000.
The sum of the 1-gram occurencies is 68064165. The sum of that covered by the 3-gram first words is .