Predicting Next Word - Capstone Project

Sergio Vicente (@svicente99)
January 2016


Data Science Specialization

Coursera.org

Summary

Here we report, in high level, a description of the Capstone Project development, as final product of Coursera Data Science Specialization.
The objective of the App is implementing a predictive model that offers hints to one or more words, coherent to the sentence that’s been input by its user. The Capstone dataset used includes twitter, news and blogs from HC Corpora. After performing data cleansing, sampling and sub-setting, we gather all data in R data frames. Applying some Text Mining (TM) and NLP techniques, is created some set of word combinations (N-grams). These are the main support to Katz Backoff algorithm predicts the next word. Some adaptations and heuristics were specially developed to enhance this Shiny application.

How the app works

Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation.

Main steps to achieve next word(s) predictions:

  1. Loading 4 data frames contained n-grams combinations with 4-words, 3-words, 2-words, and 1-word previously generated.
  2. Reading user input (a word or sentence)
  3. Cleansing of user input (lowering, profanities removing, tokenization of input words: the last four)
  4. Call to prediction model function, basically, the Stupid backoff algorithm (a more simplified approach to Katz Backoff):
    • search in the fourgram data frame, if found, shows top 5 most probable matches. Otherwise;
    •    search in the trigram data frame, by the same way above. Otherwise;
    •       search in bigram data frame, by the same way above.
    •          else, at last, if none matching, displays the most frequent words in the unigram data frame.

N-grams excerpts

See 5 lines of “bigrams” and “trigrams” data frames which are loaded by Shiny App.

Word Freq Prob
in the 26169 0.00267243534440501
for the 24647 0.00251700538551532
of the 19001 0.00194042355378653
on the 15965 0.00163038061345202
to be 15648 0.00159800788219839
Word Freq Prob
thanks for the 7830 0.000799616674182859
looking forward to 2863 0.000292375803088828
cant wait to 2835 0.000289516382031725
thank you for 2812 0.000287167571877676
i love you 2770 0.00028287844029202

Viewing the Shiny App

Shiny App - Next Word Prediction