In this app, you could type in a short phrase and click the prediction button. The app will predict the top 5 words that are mostly likely to come next.
Model: Count based n-gram language model (unigram, bigram, trigram) with interpolated backoff.
Goal: Predict next token given the last one or two tokens of history.
Conditional probabilities (from counts):
Unigram: P(w) = count(w) / sum_v count(v)
Bigram: P(w2|w1) = count(w1,w2) / sum_v count(w1,v)
Trigram: P(w3|w1,w2) = count(w1,w2,w3) / sum_v count(w1,w2,v)
Interpolation: P(w3|w1,w2) = a3P_tri + a2P_bi + a1P_uni, where a1+a2+a3=1.
Smoothing options: add-k or Kneser–Ney to reduce zeros.
Prediction: return top-5 tokens by P*.
The overall accuracy for unigram, bigram, and trigram is acceptable.