Corpus Based Word Prediction

We propose to use the statistics derived from 3 large bodies of text data to create a stistical model. Given a few words of text as input, we will use this statistical model as a reference to try and predict the next word in sequence.

The Problem

Given the variety of words sequences in natural language, it may seem an impossible task to predict the next word in a sentence, given only the first three or four words. By analyzing words in observed sequences, and then measuring the frequency of single words, word pairs and words triplets, we believe that we can, for example, discover the most likely fourth word given the first three words of a sentence.

The Data

For this proof of concept, we have been given access to three corpora of text:

  • 37 million words in context coming from 900,00 lines of web blog content.
  • 34 million words in context coming from 1,000,000 lines of content from news stories.
  • 30 million words/emojis/symbols in context coming from 2.36 million twitter messages

Here is an example plot showing how almost 1600 words occur only once in the data. Words showing up twice in the news data number just a little over 400. There are fewer and fewer words as we look towards higher frequency.

News data histogram of single term frequencies

The Plan

Our plan is comprosed of a few steps:

  1. iterate through all the data to identify and count instances of each of the words, word pairs and word triples from the three corpora. Three models will be made; one for each corpus. When predicting the next word, we will try and match input text against each model to see which seems to fit best

  2. prune the model so that only the most likely candidates remain. For example, many triples begin with the words “I want”. Since “I want to” is the most likely triple that starts with “I want”, it will remain. All other triples starting with “I want” will be removed from the model
  3. Create a shiny app which can take in user input and look up that user input to find the best prediction. Here we will use a very simple back off strategy: take the last two words of input and try to find a match in the triplets. If we find a match, offer the third word from the triplet as a prediction. If no match is found, try and match the last word of input in the pairs. If we find a match, offer the second word from the pair as a prediction. If no match is found, then offer the most likely (most frequent) single word from the data as a prediction.