Mohamad Raziff bin Ramli
April 2016
The Capstone dataset used includes twitter, news and blogs from HC Corpora. The data performing cleansing, sampling and sub-setting, before gather all data in R data frames. Applying some Text Mining (TM) and NLP techniques, is created some set of word combinations (N-grams). These are the main support to Katz Backoff algorithm predicts the next word. Some adaptations and heuristics were specially developed to enhance this Shiny application.
Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation.
See 5 lines of “bigrams” and “trigrams” data frames which are loaded by Shiny App.
Word | Freq | Prob |
---|---|---|
in the | 26169 | 0.00267243534440501 |
for the | 24647 | 0.00251700538551532 |
of the | 19001 | 0.00194042355378653 |
on the | 15965 | 0.00163038061345202 |
to be | 15648 | 0.00159800788219839 |
Word | Freq | Prob |
---|---|---|
thanks for the | 7830 | 0.000799616674182859 |
looking forward to | 2863 | 0.000292375803088828 |
cant wait to | 2835 | 0.000289516382031725 |
thank you for | 2812 | 0.000287167571877676 |
i love you | 2770 | 0.00028287844029202 |