Next.Word - a smart next-word predictor

manifits
15 August 2017

Welcome to Next.Word

Next.Word is a shiny web application where, based on user input, five next-word candidates (by default, but user selectable from 1 to 10) will be presented.

This is a Capstone project as part of Coursera John Hopkins Data Science specialization. The goal of the project is to develop a text predictive model using a large, unstructured database of the English language.

How to Use the App

The link to the app is here: Next.Word.

The Main tab has the following sections

A slider input to display the number of next word predictor, the default is 5.
An input box for users to key in phrases. This input phrase is converted to lower, and the punctuation marks, numbers and trailing spaces and white spaces are removed before being fed to the prediction algorithm.
A next word candidate table which gives the predictors for the next word and its corresponding score.
An execution time section gives the time taken by the prediction algorithm.

The About tab gives some details on the algoritms and how the app was implemented..

The References tab gives a list of useful natural language processing links.

How the App Was Created

The training set is derived from a corpus called HC Corpora which is collected from publicly available sources (tweets, blogs and news) by a web crawler.

We first read in the English blogs, tweets and news data.
Then we split the combined data into training (60%), validation (20%) and test sets (20%).
Next we create the corpus for the training set and use the tm package to clean it.

a. We remove punctuations except apostrophe, remove emojis, convert to lower, remove numbers, remove swear words and strip whitespace.

b. We did not remove stop words or stem the document since this might lose contextual information as we want to use the corpus to predict the next word.
After this we tokenize the corpus to create 1-grams, 2-grams, 3-grams, 4-grams and 5-grams. These ngrams are used for word prediction. We only keep terms that appear more than 10 times.

The Algorithm

We used the Stupid Backoff Algorithm to rank next word candidates. We “backoff” to a lower order ngram if we have zero evidence for a higher order ngram.

We first match the last 4 words of the input sentence with the first 4 words of the 5-grams.
If the count of the matches is less than 5, we match the last 3 words of the input sentence with the 1st 3 words of the 4-grams.
If the cumulative count of the matches is still less than 5, we repeat the above with 3-grams and then 2-grams.
Finally if the cummulative count of the matches is still less than 5, we fill the remaining words with the top 1-grams.

Based on the phrase input by the user, we reactively find the next word (which could be an empty set) for the 5-grams, 4-grams, 3-grams, 2-grams and 1-grams, and merge the output. We then select the top 5 distinct values in the merged output to display as the next-word predictor.