R Shiny application to predict the next word using natural language processing


Data Science Capstone


Fabien Tarrade

Quantitative Analyst - Data Scientist - Researcher

Introduction and goal

Introduction

  • The main goal of this capstone project is to build an interactive Shiny R application that can predict the next word following a phrase of input text.
  • We used Natural Language Processing N-Gram model and the “Stupid Katz Back-off” model in the final implementation
  • We use a data-set from a corpus called HC Corpora Data) and in particular 3 data-sets in English: blogs, news and twitter containing respectively 0.9 million, 1 million, and 2.4 millions of lines of text to train our model.
  • The link for the Shiny R web application Application

Constraints

We need to consider for our application the size in memory and the CPU performances for a suitable delay for the model to compute the prediction since the free Shiny application plan has some strong restrictions. This will constrain the training data-set and the model to use.

Desciption of the algorithm

The next word prediction model is based on the “Stupid Katz Back-off” algorithm given that this models is the best for web-scale data and work well in practice More details.

  • data-set was cleaned: all in lowercase, removing white space and all special characters
  • data-set was tokenized into sorted N-grams with cumulative frequencies (1 to 4-grams)
  • low frequency N-grams were further filtered to reduce their size for optimum performance
  • load the 4 data frames containing the N-grams (saved as R-Compressed files)
  • the same techniques are applied on phrase of input text given by the user
  • extract last three or two or the last one word of input text given by the user
  • if 4-grams with the 3 last entered words as prefix were found the algorithm returns the 3 most frequent 3-gram suffixes as predicted words
  • if no 4-gram is matched, back-off to 3-grams and match with the 2 last entered words
  • if no 3-gram is matched, back-off to 2-grams and match with last entered words
  • finally if no match found in 2-grams, use the most frequent words from 1-gram

Closing remarks

The performance of our implementation of the “Stupid Katz Back-off” algorithm has an accuracy of ~20% to compare to SwiftKey with an accuracy of >30% (couldn't find any official numbers). Removing stop word and using stem words didn't help. The novel approach was to optimized the code to run on the entire data-set quickly using parallelize vectorized functions.

Some possible improvements:

  • improve the accuracy by using bigger training set (trillion of word available on the web)
  • use more variety of sources since style differ from genre and source
  • fix mistake, typo, reduce word for a better prediction
  • we didn't consider the punctuation in the prediction but this could be added
  • add smoothing for rare or unseen N-grams (Good Turing, Kneser-Ney, Witten-Bell)
  • use neural-network language models but this is more computationally complex and require more memory

The references for this application “(see More/References)”

Instruction for the Shiny R application

Below we give the instructions and describe how it function :

  1. Go to the Application and enters a sequence of words in the text box
  2. Press “Next Word” button. The predicted next word is displayed with the original sentence
  3. Is also displayed a note indicating which specific N-gram was used for next word prediction

Below an example of the results:

This tool is offered under the standard Beerware license