Greig Robertson
March 2018
At the core of the application is a data set containing different levels of ngrams from three source texts (Twitter, News feeds and Blogs). The ngram data set was created by:
2, 3, and 4-grams were created using the technique described above. Creating 5-grams resulted in a very large data set and was processed without the frequency ordering step (97% of 5-grams had a frequency of 1).
To ensure fast look up of a predicted word based on a phrase, the ngram data set consisted of two columns:
phrase column with the words of the ngram excluding the last onepredicted_word column containing the last word of the ngramBelow are sample phrases and word predictions from ngram data set.
| N-gram | Phrase | Predicted Word |
|---|---|---|
| 2 | of | the |
| 3 | one of | the |
| 4 | the end of | the |
| 5 | your dreams live the | life |
The algorithm works as follows:
phrase column) the predicted_word would be returnedIn this example, the word “the” would be returned since we have (from the previous slide) the phrase “one of” that has a predicted word of “the”.
When a phrase is not found, the first word of the phrase is dropped and the search repeated. This continues until either a match is found or the search fails, in which case a random word is returned.
The application was tested by using testing and training samples from the source texts.
To use the application:
Improvements could be made by trying to determine the subject or domain that the user is writing about and making the word predictions more relevant to that context.