Predicting Next Word Application - Capstone Final Project

Mohamad Raziff bin Ramli
April 2016


Data Science Specialization

Overview

The goal of this exercise is to create a product to highlight the prediction algorithm that had been built and to provide an interface that can be accessed by others. Thus, to fullfil this requirement a Shiny app had been develop that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. The description of the Capstone Project development as follow:

The Capstone dataset used includes twitter, news and blogs from HC Corpora. The data performing cleansing, sampling and sub-setting, before gather all data in R data frames. Applying some Text Mining (TM) and NLP techniques, is created some set of word combinations (N-grams). These are the main support to Katz Backoff algorithm predicts the next word. Some adaptations and heuristics were specially developed to enhance this Shiny application.

How this app works

Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation.

Main steps to achieve next word(s) predictions:

  1. Loading 4 data frames contained n-grams combinations with 4-words, 3-words, 2-words, and 1-word previously generated.
  2. Reading user input (a word or sentence)
  3. Cleansing of user input (lowering, profanities removing, tokenization of input words: the last four)
  4. Call to prediction model function, basically, the Stupid backoff algorithm (a more simplified approach to Katz Backoff):
    • search in the fourgram data frame, if found, shows top 5 most probable matches. Otherwise;
    • search in the trigram data frame, by the same way above. Otherwise;
    • search in bigram data frame, by the same way above.
    • else, at last, if none matching, displays the most frequent words in the unigram data frame.

Effect on N-grams

See 5 lines of “bigrams” and “trigrams” data frames which are loaded by Shiny App.

Word Freq Prob
in the 26169 0.00267243534440501
for the 24647 0.00251700538551532
of the 19001 0.00194042355378653
on the 15965 0.00163038061345202
to be 15648 0.00159800788219839
Word Freq Prob
thanks for the 7830 0.000799616674182859
looking forward to 2863 0.000292375803088828
cant wait to 2835 0.000289516382031725
thank you for 2812 0.000287167571877676
i love you 2770 0.00028287844029202

Viewing the Shiny App

Shiny App - Next Word Prediction