Ngram Word Prediction

Yul Young Park
09/27/2018

Shiny App: https://yypark.shinyapps.io/Shiny_PredictText/

Goal of this project is to build a predictive text model, a language model that helps people type on mobile devices.

Language model(LM) is to compute a conditional probability of an upcoming word(\( W_{N} \)) given the sequence of previous words(\( W_{1}, W_{2}, ..., W_{N-1} \)) \[ P(W_{N} | W_{1}, W_{2}, ..., W_{N-1}) \]
Markov Assumption simplifies the calculation of the probability via approximation:
\[ P(W_{N} | W_{1}, W_{2}, ..., W_{N-1})\approx P(W_{N}|W_{N-k},\dots ,W_{N-1}) \;\;\; \]
N-gram models based on Markov assumption are unigram (k=1), bigram (k=2), trigram(k=3), …
Words that don't apprear in the training set has zero probability and need to be taken care of by smoothing technique such as backoff.

Data: from Capstone Dataset
- blogs, news, and twitter files (refer to EDA for details)
- total 3,336,695 lines, 6,291,066 sentences, and 38,380,791 tokens
Language model used: 3-, 2-, and 1-gram with backoff model
model performance measured by train(80%) and test data(20%):
- accuracy: 27.7% (sacrificed for the sake of speed)
- efficiency(spped): 0.083 sec (average of 10 test phrases)

Built shiny app

References: