Data Science Capstone

Wei Chen

Predictive Text Model

In this app, you could type in a short phrase and click the prediction button. The app will predict the top 5 words that are mostly likely to come next.

Algorithm Overview

Model: Count based n-gram language model (unigram, bigram, trigram) with interpolated backoff.
Goal: Predict next token given the last one or two tokens of history.
Conditional probabilities (from counts):
- Unigram: P(w) = count(w) / sum_v count(v)
- Bigram: P(w2|w1) = count(w1,w2) / sum_v count(w1,v)
- Trigram: P(w3|w1,w2) = count(w1,w2,w3) / sum_v count(w1,w2,v)

Algorithm Overview (.cont)

Interpolation: P(w3|w1,w2) = a3P_tri + a2P_bi + a1P_uni, where a1+a2+a3=1.
Smoothing options: add-k or Kneser–Ney to reduce zeros.
Prediction: return top-5 tokens by P*.

App Accuracy

The overall accuracy for unigram, bigram, and trigram is acceptable.