Predicting Next Word - Capstone Project

Sergio Vicente (@svicente99)
January 2016

Data Science Specialization

Coursera.org

Summary

Here we report, in high level, a description of the Capstone Project development, as final product of Coursera Data Science Specialization.

The objective of the App is implementing a predictive model that offers hints to one or more words, coherent to the sentence that’s been input by its user. The Capstone dataset used includes twitter, news and blogs from HC Corpora. After performing data cleansing, sampling and sub-setting, we gather all data in R data frames. Applying some Text Mining (TM) and NLP techniques, is created some set of word combinations (N-grams). These are the main support to Katz Backoff algorithm predicts the next word. Some adaptations and heuristics were specially developed to enhance this Shiny application.

How the app works

Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation.

Main steps to achieve next word(s) predictions:

Loading 4 data frames contained n-grams combinations with 4-words, 3-words, 2-words, and 1-word previously generated.
Reading user input (a word or sentence)
Cleansing of user input (lowering, profanities removing, tokenization of input words: the last four)
Call to prediction model function, basically, the Stupid backoff algorithm (a more simplified approach to Katz Backoff):

search in the fourgram data frame, if found, shows top 5 most probable matches. Otherwise;
search in the trigram data frame, by the same way above. Otherwise;
search in bigram data frame, by the same way above.
else, at last, if none matching, displays the most frequent words in the unigram data frame.

N-grams excerpts

See 5 lines of “bigrams” and “trigrams” data frames which are loaded by Shiny App.

Word	Freq	Prob
in the	26169	0.00267243534440501
for the	24647	0.00251700538551532
of the	19001	0.00194042355378653
on the	15965	0.00163038061345202
to be	15648	0.00159800788219839

Word	Freq	Prob
thanks for the	7830	0.000799616674182859
looking forward to	2863	0.000292375803088828
cant wait to	2835	0.000289516382031725
thank you for	2812	0.000287167571877676
i love you	2770	0.00028287844029202

Viewing the Shiny App

Shiny App - Next Word Prediction

Link to Shiny App ←← Use it!
GitHub repository code to this application
Possible improvements…