We propose to use the statistics derived from 3 large bodies of text data to create a stistical model. Given a few words of text as input, we will use this statistical model as a reference to try and predict the next word in sequence.
Given the variety of words sequences in natural language, it may seem an impossible task to predict the next word in a sentence, given only the first three or four words. By analyzing words in observed sequences, and then measuring the frequency of single words, word pairs and words triplets, we believe that we can, for example, discover the most likely fourth word given the first three words of a sentence.
Here is an example plot showing how almost 1600 words occur only once in the data. Words showing up twice in the news data number just a little over 400. There are fewer and fewer words as we look towards higher frequency.
Our plan is comprosed of a few steps:
iterate through all the data to identify and count instances of each of the words, word pairs and word triples from the three corpora. Three models will be made; one for each corpus. When predicting the next word, we will try and match input text against each model to see which seems to fit best
Create a shiny app which can take in user input and look up that user input to find the best prediction. Here we will use a very simple back off strategy: take the last two words of input and try to find a match in the triplets. If we find a match, offer the third word from the triplet as a prediction. If no match is found, try and match the last word of input in the pairs. If we find a match, offer the second word from the pair as a prediction. If no match is found, then offer the most likely (most frequent) single word from the data as a prediction.