Data Science Capstone: Next Word Prediction

Mario R. Melchiori
22/04/2016

Data select and Cleaning

Prior to build word prediction algorithm, the following steps were executed:

  • A subset of the original data was randomly selected from the three sources and merged into one. Due to the lack of computational power only the 10% the original data was randomly selected.
  • Data cleaning involved converting to lower case, removing punctuations, numbers and non printable characters and bad words.
  • Four sets of word combinations (n-grams), with quadrigrams, trigrams, bigrams, and unigrams were then created.
  • After to calculate their cumulative frequencies, we map single word to number to look up quickly later. This approach additionally, makes files use less RAM memory.
  • Finally, the four n-gram objects were saved as Rds files.

Descriptions of the algorithm used

Katz-Backoff

  • First we use a QuadriGram; the first three words of which are the last three words of the user provided sentence, for which we are trying to predict the next word;
  • If no QuadriGram is found, we back off to TriGram (first two words of TriGram last two words of the sentence);
  • If no TriGram is found, we back off to BiGram (first word of BiGram last word of the sentence) and
  • If no BiGram is found, we back off to UniGram (the most common word with highest frequency), If no matching n-grams can be found the algorithm predicts the most common tokens (the, to and a).

Shiny Application

A Shiny application was developed based on the next word prediction model described previously. Here are main features of the Application available here:

  • User must enter a sequence of words in the text box.
  • While the user input, up to four most likely next word are displayed.
  • The clear user entered sentence and most likely next word is displayed in the Shiny.

alt text

Accuracy

Taking in account that in Natural Programming Language as much data training as possible and more data means better estimates, in this case using only 10% of total data being sampled from the 3 text files, the application seems to be doing a decent work based on Jan Hagelauer's benchmark.R program (https://github.com/jan-san/dsci-benchmark):

  • Overall top-3 score: 15.04 %
  • Overall top-1 precision: 11.75 %
  • Overall top-3 precision: 17.95 %
  • Average runtime: 30.06 msec
  • Number of predictions: 28464
  • Total memory used: 522.58 MB