A.Casares M
June 14th 2017
A simulation of the process of predicting the next entry word into a mobile device, based on the previous typed-in words, using Machine Learning techniques applied to the Natural Language Processing field.
Self explaining and documented working screen. With examples included. Two possible environments.
Three main objectives:
Ad-hoc light and fast data structures:
The system sets up a window frame four words wide, and moves it from the first left position, one word at the time, towards the right, keeping aligned with the last input word (typed or choosed from suggestions).
Taken from a famous speech, this example allows us to see one instance of prediction:
The full prefix my fellow americans is been used in the model Quadgrams. Finds one quadgram having it as a root: “my fellow americans i”, with a frequency 1.
That makes the model 4 the base model, and the probabilities of other models will be projected upon it using alpha and beta, probability projection factor, progressively less in each new model.
Using fellow americans as a root the search does not throw any valid prediction in Trigrams.
Using americans as root, 99 bigrams are found. The 4 with highest frequencies in those bigrams correspond to: “americans are”, “americans who”, “americans have”,“americans and”. Their probabilities are computed using the alpha and beta values, that make them comparable with the probability found on the base model.
The valid candidates are then sorted by decreasing probabilities, and a final list is obtained. In this case, the prefered option is are (coming from a bigram), then I (coming from the quadgram), and the rest from bigrams.
The Kat'z backoff smoothing and the Good-Turing estimators not only assign non zero probabilities to out-of-vocabulary words (OOVs), but make it possible the comparison between different models.
Although frequently the highest probabilities come from the base model, it is not always that way, as we have just seen in this example.