Data Science Capstone

Wei Chen

Predictive Text Model

In this app, you could type in a short phrase and click the prediction button. The app will predict the top 5 words that are mostly likely to come next.

Algorithm Overview

  • Model: Count based n-gram language model (unigram, bigram, trigram) with interpolated backoff.

  • Goal: Predict next token given the last one or two tokens of history.

  • Conditional probabilities (from counts):

    • Unigram: P(w) = count(w) / sum_v count(v)

    • Bigram: P(w2|w1) = count(w1,w2) / sum_v count(w1,v)

    • Trigram: P(w3|w1,w2) = count(w1,w2,w3) / sum_v count(w1,w2,v)

Algorithm Overview (.cont)

  • Interpolation: P(w3|w1,w2) = a3P_tri + a2P_bi + a1P_uni, where a1+a2+a3=1.

  • Smoothing options: add-k or Kneser–Ney to reduce zeros.

  • Prediction: return top-5 tokens by P*.

App Accuracy

The overall accuracy for unigram, bigram, and trigram is acceptable.