Method used
In general, based on the last two words \((w_{i-2}, w_{i-1})\) of the input text, the program tries to predict the next word \(w^{\prime}_i\) by finding a 3-gram \((w_{i-2}, w_{i-1}, w^{\prime}_i)\) with the highest probability among all the \(w_i\). In order to find out the probabilities of all the possible 3-grams in the model, we need to explore 3-grams which are not seen in the training dataset, and assign them some reasonable larger than zero probabilities for comparison. The method adapted in the app is to build the prediction model based on Katz backoff 3-gram modeling algorithm, which in turn uses Good-Turing discounting to calculate the probabilites for the backoff algorithm.
Katz backoff model
In the Katz backoff N-gram model, if the count, so as the probability, of an N-gram we need is zero, we then back off the estimation to the (N-1)-gram, which is to approximate the probability of the N-gram by using the (N-1)-gram probability. This method requires us to first discount part of the probabilities from the seen (N-1)-gram and then distributed to those unseen N-grams. The discount algorihtm used here is Good-Turing discounting.
Good-Turing discounting
The intuition of Good-Turing is to estimate the count of unseen N-grams by the count of N-grams which appear once. More details can be found on chapter 4 of Speech and Language Processing, 2nd ed. by Dan Jurafsky and James H. Martin.