Predictive Text

Greig Robertson
March 2018

la lune

Data Processing

At the core of the application is a data set containing different levels of ngrams from three source texts (Twitter, News feeds and Blogs). The ngram data set was created by:

  • Reading in all text from the three sources into a single data set
  • Using the Tidytext package to tokenise the source text into ngrams
  • Summarizing and ordering the resulting ngrams by frequency

2, 3, and 4-grams were created using the technique described above. Creating 5-grams resulted in a very large data set and was processed without the frequency ordering step (97% of 5-grams had a frequency of 1).

Data Processing - Optimizing

To ensure fast look up of a predicted word based on a phrase, the ngram data set consisted of two columns:

  • a phrase column with the words of the ngram excluding the last one
  • a predicted_word column containing the last word of the ngram

Below are sample phrases and word predictions from ngram data set.

N-gram Phrase Predicted Word
2 of the
3 one of the
4 the end of the
5 your dreams live the life

Prediction Algorithm

The algorithm works as follows:

  • a person would type in a phrase such as “one of”
  • a search through the ngram data set would be performed using the phrase
  • if the phrase was found (in the phrase column) the predicted_word would be returned

In this example, the word “the” would be returned since we have (from the previous slide) the phrase “one of” that has a predicted word of “the”.

When a phrase is not found, the first word of the phrase is dropped and the search repeated. This continues until either a match is found or the search fails, in which case a random word is returned.

The application was tested by using testing and training samples from the source texts.

How To Use

To use the application:

  • The user navigates to the application
  • The application loads the ngram dataset (this can take about 10 seconds)
  • A default start word is suggested
  • The user types in their text and the application predicts the next word and also…
  • Shows a list of other predicted words
  • Shows a plot of counts per ngram data set and a count of when the application failed to find a next suittable word.
  • For fun, the application generates a small chunk of text too.

Improvements could be made by trying to determine the subject or domain that the user is writing about and making the word predictions more relevant to that context.