Predictive text using N-Gram model

pmohan
2017-06-20

This application uses a 3-gram model to predict text based on the provided training set of news, blogs and twitter data. This application is only trained on locale en_US.
This model uses a probabilistic method (maximum likelyhood estimate) to predict the occurrence of the third word in the trigram given the previous two words.
Since it is always possible to come across words not found in the training set (probability of zero), the probability of valid sentences will become zero. In order to prevent this a technique called smoothing is applied.
I have used a popular smoothing technique called Kneser-Ney smoothing in this application. I have provided a brief description of this method in my milestone report

In our case let us define entropy H(p) as the minimum amount of bits to represent a real language p. For a language model m, which is a model for the real language with a probability distribution p, we can estimate the cross-entropy H(p,m)
Cross entropy H(p,m) is an upper bound of entropy H(p). That is, it has been proved that: \[ H(p) ≤ H(p, m) \]
This means that a simplified language model m (for example 3-gram) can represent a complex language p. The more accurate m is, the closer it will be to the true entropy.
The perplexity, the measure of accuracy of a model is the exponential of cross-entropy (since entropy is calculated in log scale)

A better N-gram model is one that assigns a higher probability to the test data. Perplexity has an inverse relationship with this probability (since it is a negative exponential to the log of p)
Using the 3-gram model with Kneser-Ney smoothing used in this application, I calculated a perplexity of 53 on my test set. Tests have shown that unigrams give a perplexity of nearly 1000 while trigrams without smoothing will be around 100. (Lower is better)

To create the application, the trained model had to be optimized. Already seen trigrams in the training data with a probability less than 0.001 was removed from the model. This reduced model when loaded in memory uses about 5 GB of RAM.
To overcome the memory limitation when running on a device like a phone, the application is designed in a client-server model. The typed text is sent to the server which tokenizes the sentence and predicts a new word from the model using the last bigram in the text.

The predicted text will be only as good as the training model. N-gram models cannot predict a style of text which is wildly different than the text it is trained on. (For example, Shakespeare vs. Wall Street Journal)
The app will start predicting text after typing in the first two words in the text box. If a prediction is found, a menu will be displayed with all the potential matches in the order of probability. The user can then narrow down from the list automatically as they type or type in an altogether new word if it's not in the list.
Thank you for reading this presentation. The application is available at shinyapps.io. Please be patient with the inital menu load time.