This model reserves the probablity from the n-gram and distributes it over the unseen words. If the combination of words is not found in higher order n-gram, then it backs-off to lower order n-grams till it reaches zero count.
Redistribution of probability is done by first using the Good-Turing method of discounting the frequencies. The count of an event occurring \( a \) times is discounted with
\( \displaystyle d_a = (a+1) \frac{c_{a+1}}{a\,.\,c_a} \)
BUT, there is a limitation - it fails if there is a case where \( c_a = 0 \), or as \( a \) increases the count \( c_a \) tends to zero.
Katz BackOff Model Helps Us Out...
A solution to Good-Turing problem was proposed by Katz, where he defines a cut-off value \( k \) at which counts \( a \) for \( a > k \) are not discounted, considering these more frequently observed counts as reliable.
The recommended value of \( k \) is 5. So, for higher \( a \) the model does not fail.
It takes care of the unseen words by adjusting the maximum likelihood estimates to reserve probability mass for unseen words combinations.