getWord App

Mridula Velagapudi
March 06, 2017

The getWord App predicts the next word for you with certain probability

alt text

The algorithm was built using Katz Back-off Model
This model reserves the probablity from the n-gram and distributes it over the unseen words. If the combination of words is not found in higher order n-gram, then it backs-off to lower order n-grams till it reaches zero count.
Redistribution of probability is done by first using the Good-Turing method of discounting the frequencies. The count of an event occurring \( a \) times is discounted with \( \displaystyle d_a = (a+1) \frac{c_{a+1}}{a\,.\,c_a} \)
BUT, there is a limitation - it fails if there is a case where \( c_a = 0 \), or as \( a \) increases the count \( c_a \) tends to zero.

A solution to Good-Turing problem was proposed by Katz, where he defines a cut-off value \( k \) at which counts \( a \) for \( a > k \) are not discounted, considering these more frequently observed counts as reliable.
The recommended value of \( k \) is 5. So, for higher \( a \) the model does not fail.
It takes care of the unseen words by adjusting the maximum likelihood estimates to reserve probability mass for unseen words combinations.

References:

The full code takes total of 870ms to run.
The maximum time is consumed in loading the libraries, which are required to form the n-grams.

Happy Word Predicting!

alt text