A Shiny App to predict the next word in a text chain

Tom Withey
02/01/19

Synopsis

I have developed an app which predicts the next word given an input chain of at least two words. This presentation includes:

  • A description of the data underlying the app, and the preparation of that data which has been undertaken
  • A description of the algorithm and how it works
  • A short tutorial on how to use the app.

Data preparation

The algorithm relies on the quanteda package to extract a 'corpus' of text data. The data used has been downloaded from the link here. This data is prepared by taking the following steps:

  • Take a 5% sample of the text data (to reduce computational run times)
  • Using the quanteda package, extract all ngrams (i.e. chains of sequential words) for word chains between one and six words long (i.e. all 1-grams, 2-grams, 3-grams, etc, up to 6-grams)
  • Calculate the frequency with which each word chain occurs in the sampled text data, as a table for each of the word chain lengths
  • Store the resulting six tables (known as ngram frequency tables) as csv files on my local drive

Before running the algorithm the six ngram frequency tables are read from the local drive as data tables, using the data.tables package. These are stored in memory. Then, based on a given input word chain, the algorithm does the following (see next slide):

Algorithm

  • If the input word chain is longer than five words, it extracts just the last five words of the chain.
  • Using the 6-gram frequency table, looks for any occurences of the input chain within the first five words of the collection of 6-grams. If a match ormatches are found, for the one which occurs most frequently, the sixth word is extracted and returned as the predicted word.
  • If no match is found, then the algorithm extracts just the last four words of the input chain, and repeats the principal above in the 5-gram table, and so on until a match is found.
  • If no match is found for the last three words, then the algorithm takes just the last two words and uses the Katz Back-Off (KBO) model to return the predicted word. An excellent description of this work can be found here (credit: Michael Szczepaniak). This also deals with word chains which are missing from the source data.
  • If the input word chain is less than five words, then the algorithm jumps to the relevant step, i.e. if the input is only a three word chain, then it starts at the step to look through the 4-gram table.
  • If the input word chain is only one word long, the algorithm will return an error.

The Shiny app

  • The shiny app contains a side bar panel with an input text box. The input text, for which the next word is to be predicted, should be entered there.
  • Then press enter on your keyboard or click the “submit” button.
  • The predicted word appears in the main panel.