R Shiny application to predict the next word using natural language processing

Data Science Capstone

Fabien Tarrade

Quantitative Analyst - Data Scientist - Researcher

Introduction and goal

Introduction

The main goal of this capstone project is to build an interactive Shiny R application that can predict the next word following a phrase of input text.
We used Natural Language Processing N-Gram model and the “Stupid Katz Back-off” model in the final implementation
We use a data-set from a corpus called HC Corpora Data) and in particular 3 data-sets in English: blogs, news and twitter containing respectively 0.9 million, 1 million, and 2.4 millions of lines of text to train our model.
The link for the Shiny R web application Application

Constraints

We need to consider for our application the size in memory and the CPU performances for a suitable delay for the model to compute the prediction since the free Shiny application plan has some strong restrictions. This will constrain the training data-set and the model to use.

Desciption of the algorithm

The next word prediction model is based on the “Stupid Katz Back-off” algorithm given that this models is the best for web-scale data and work well in practice More details.

data-set was cleaned: all in lowercase, removing white space and all special characters
data-set was tokenized into sorted N-grams with cumulative frequencies (1 to 4-grams)
low frequency N-grams were further filtered to reduce their size for optimum performance
load the 4 data frames containing the N-grams (saved as R-Compressed files)
the same techniques are applied on phrase of input text given by the user
extract last three or two or the last one word of input text given by the user
if 4-grams with the 3 last entered words as prefix were found the algorithm returns the 3 most frequent 3-gram suffixes as predicted words
if no 4-gram is matched, back-off to 3-grams and match with the 2 last entered words
if no 3-gram is matched, back-off to 2-grams and match with last entered words
finally if no match found in 2-grams, use the most frequent words from 1-gram

Closing remarks

The performance of our implementation of the “Stupid Katz Back-off” algorithm has an accuracy of ~20% to compare to SwiftKey with an accuracy of >30% (couldn't find any official numbers). Removing stop word and using stem words didn't help. The novel approach was to optimized the code to run on the entire data-set quickly using parallelize vectorized functions.

Some possible improvements:

improve the accuracy by using bigger training set (trillion of word available on the web)
use more variety of sources since style differ from genre and source
fix mistake, typo, reduce word for a better prediction
we didn't consider the punctuation in the prediction but this could be added
add smoothing for rare or unseen N-grams (Good Turing, Kneser-Ney, Witten-Bell)
use neural-network language models but this is more computationally complex and require more memory

The references for this application “(see More/References)”

Instruction for the Shiny R application

Below we give the instructions and describe how it function :

Go to the Application and enters a sequence of words in the text box
Press “Next Word” button. The predicted next word is displayed with the original sentence
Is also displayed a note indicating which specific N-gram was used for next word prediction

Below an example of the results:

This tool is offered under the standard Beerware license