Ngram Word Prediction

Yul Young Park
09/27/2018

Overview

Goal of this project is to build a predictive text model, a language model that helps people type on mobile devices.

  • Language model(LM) is to compute a conditional probability of an upcoming word(\( W_{N} \)) given the sequence of previous words(\( W_{1}, W_{2}, ..., W_{N-1} \)) \[ P(W_{N} | W_{1}, W_{2}, ..., W_{N-1}) \]
  • Markov Assumption simplifies the calculation of the probability via approximation:
    \[ P(W_{N} | W_{1}, W_{2}, ..., W_{N-1})\approx P(W_{N}|W_{N-k},\dots ,W_{N-1}) \;\;\; \]
  • N-gram models based on Markov assumption are unigram (k=1), bigram (k=2), trigram(k=3), …
  • Words that don't apprear in the training set has zero probability and need to be taken care of by smoothing technique such as backoff.

Predictive Text Model Implementation

  • Data: from Capstone Dataset

    • blogs, news, and twitter files (refer to EDA for details)
    • total 3,336,695 lines, 6,291,066 sentences, and 38,380,791 tokens
  • Language model used: 3-, 2-, and 1-gram with backoff model

  • model performance measured by train(80%) and test data(20%):

    • accuracy: 27.7% (sacrificed for the sake of speed)
    • efficiency(spped): 0.083 sec (average of 10 test phrases)

Predictive Text Product: Shinny App

Built shiny app

  • Step1: type in your own phrases in the text box in side panel
  • Step2: click 'submit' button below to see the predicted word

Conclusion

  • Pridicting a next word with reasonable amount of delay
  • Word cloud plot complement the result by supplying the other less likely words
  • Usage-based user input data may improve the model performance

References:

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html http://datasciencespecialization.github.io/capstone/