Sam Tomioka
September 19, 2018
The Next Word Prediction web application accepts text inputs from users, and suggests the next word based on the preceding words.
The prediction language model was built from approximately 5000 random samples of over 3 million text lines obtained through blog, news, and twitter.
Note In order to increase the speed of prediction, the smaller model (5000 lines) was buit for this web application and the model accuracy is significantly decreased.
The text data were tokenized to 1-5 grams. The maximum likelihood estimation (MLE) is calculated. Since some words may not have MLE, the discounting method is applied to give some probabilities to missing words. Some of the probability mass is taken from observed n-grams and distributed to unobserved n-grams in order to estimate probabilities of unseen n-grams.
Amount of missing probability mass will be
\[ \alpha(w_{n-1}) = 1 - \sum\limits_{w\in\:\mathcal{A}(w_{n-1})} \frac{c^*(w_{n-1},w)}{c(w_{n-1})} \]
and unobserved bigram probability will be
\[ \alpha(w_{n-2},w_{n-1}) = 1 - \sum\limits_{w\in\mathcal{A}(w_{n-2},w_{n-1})} \frac{c^*(w_{n-2},w_{n-1},w)}{c(w_{n-2},w_{n-1})} \]
When backed-off unigram probability is \( q_{ML}(w_n) \), the backed off bigram probability is \( q_{BO}(w_n|w_{n-1}) \), and the amount of discounted probability mass \( \alpha(w_{n-1}) \) assigned to unknown bigrams \( q_{BO}(w_n\:|w_{n-1}) \) and \( \alpha(w_{n-2},w_{n-1}) \) assigned to unknown trigrams \( q_{BO}(w_i|w_{i-2},w_{i-1}) \) would be:
\[ q_{BO}(w_n\:|\:w_{n-1}) = \alpha(w_{n-1})\frac{q_{ML}(w_n)}{\sum\limits_{w\in\mathcal{B}(w_{n-1})}q_{ML}(w)} = \alpha(w_{n-1})\frac{c(w_n)}{\sum\limits_{w\in\mathcal{B}(w_{n-1})}c(w)} \]
In an essence, the prediction model looks for the match in the five-grams model, and select the word with the highest probability. The model continue to search for the terms using next lower language model until the match is identified.
Validation
The prediction model was validated against a random sample from the text data used in the model building. The step was repeated until the model is optimized for the purpose. The model was adjusted using the amount of sample used in each language model. For each line of sample text, the last word was striped and used as the class vector. The model was run on each line of sample text data to predict the last word and to measure the accuracy of the model.
You will see “Enter some sentence” on the right main panel which indicates that the App is ready to suggest next word as you type your sentence.
1) Wait for “Enter some sentence”
Keyborad application for mobile devices
This model should help users writing sentences faster
Search engines
This model could be used to increase the relevancy of the search results as the next word may provide more specific information
# ##