Marco Marchetti
November 12, 2017
Johns Hopkins University Capstone Pitch (Coursera)
The Predict Next Word Application predict three most probable next words based on the user text input. The main application goals are the performances, trying also to reduce the memory allocation. This presentation summarize:
The predictive model is based on a HC Corpora data set that includes files containing texts extracted from blogs, news and twitter. The english data set contains over 100 million words and a detailed analysis is available here.
The main objective of the data processing is to create contiguous sequence of words (n-gram) from text sequences. The preprocess steps executed on data are:
To reduce memory allocation only nGram with frequency greather than 1 are stored in the nGram Tables.
To improve performances some prediction algorithm parameters are calculated in the ngram Table creation phase; other parameters are calculated on the fly while user typing the text.
Simple Katz Backoff with Good-Turing Discounting is used to calculate the conditional probability of a word against its preceding word. If the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the (n-1)Gram. (Wikipedia Katz's back-off model Page)
An Experimental use of the Katz's back-off algorithm has also been tested. The main idea is to consider couple of words as a single word in the prediction algorithm; the prediction is so based on what we call nBiGram Model (bi-bigram", “tri-bigram”.. and so on).
References:
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400 - 401.
After the nGram Tables loading phase the user can type the text in the textbox and the application will display the top three predicted words in the Swiftkey prediction bar simulation area. The central word is the most probable word and the two words on left and right are less probable words.
The algorithm performances are under 8 msec allowing an “instant” version of the Application. The usage of the Application is quite simple and implements three modes:
The Application is available here