- Good-Turing Estimate is based on the assumption that the probability of unseen sequences can be estimated based on N1 which is the number of n-grams that occured once in the Corpus.
- Since the probability mass of unseen n-grams is now greater than 0, the probability mass of seen n-grams needs to be reduced to let the total probability sum to 1. This is achieved with
pr=r∗N
where
r∗=(r+1)Nr+1Nr
- Here Nr is the frequency of n-grams that were seen r times in the corpus and r∗ is the adjusted frequency r.