ACGII
Tue Dec 20 16:52:28 2016
Often when entering text into a mobile telephone or other handheld device, your are prompted with a likely next word. This is the jist of this application. The user will supply a partial phrase or sentence and the application will provide its best guess for the next word.
The SwiftKey data set used to make this application consists of samples taken from three sources, the news, blogs and twitter messages. The data set is huge and must be rendered into a usable form that is both timely and compact.
The application uses this data to predict the next word in a phrase. The method used for this prediction relies on a history of previous phrase fragments, known as Ngrams, to accomplish this. The application searches through the data and determines the next word based on statistical likelihood.
The different types of Ngrams used are:
(borrowed from Dr. Seuss - I do not like green eggs and ham.)
* Unigrams - single words found in text ( I)
* Bigrams - two contiguous words found in text ( I do)
* Trigrams - three contiguous words '' '' '' ( I do not)
* Quadragrams - four contiguous words '' '' '' ( I do not like)
* Pentagrams - five contiguous words '' '' '' ( I do not like green)
Each of the different type of Ngrams are stored in their own separate table along with the frequency of occurance. In other words there are five separate tables, one each, for unigrams, bigrams, trigrams, quadragrams and pentagrams. These table contain two columns, name and frequency.
The simple sentences are decomposed into unigrams, bigrams and trigrams. There are no quadragrams or pentagrams.
* Unigram(frequency) - I(2), am(2), Sam(2)
* Bigram(frequency) - I am(2), am Sam(1), Sam I(1)
* Trigram(frequency) - I am Sam(1), Sam I am(1)
When a sentence fragment is input, it is broken down into quadragrams and the pentagram table is searched for entries matching the last quadragram.(Markov). The algorithyhm will select the most populus result.
If no matches are found the fragment is broken into trigrams and the quadragram table is searched. If no matches are found this process is repeated for bigrams and unigrams.
The word match with the highest frequency is used as the most likely candidate.
* Input: Merry Next Word: christmas
* Input: Happy Next Word: birthday
* Input: Happy New Next word: year
Try it yourself at: