Yet Another Next Word Guesser

Pascal P.
01-12-2018

Yet Another Next Word Guesser [やねをぐ]

Context: Coursera JHU, Capstone Project

Purpose 2/5

  • Build a language model based on a corpora provided by Swiftkey
  • The corpora is composed of three sources
    • blogs
    • news and
    • twitter
  • Use the built language model (5-grams) to predict next word given some previous words (here 4 words).
  • Build a web interactive application using R shiny, which given a sentence provides a set of prediction (possible next words)

Model 3/5

  • Built a 5-grams probabilistic language model using simple (“stupid”) back-off algorithm, using R quanteda package.
  • Used full corpora, after clean-up and created three sets: training (80%), test(10%) and development (10%).
  • Background
    • A probabilistic model is based on the idea that in order to predict the next word, we only need a few previous words (Markov assumption).
    • In a 5-grams model, we relies on the four previous words to predict the fifth one.
    • Formally, we have a sequence of \( N \) words \( w_{1}^{n} = w_{1}..w_{n} \) and when we use a 5 grams model, we are making the following approximation: \[ \begin{aligned} P(w_{n}|w_{1}^{n-1}) \approx P(w_{n}|w_{n-1}^{n-4}) \end{aligned} \] where \( P(w|h) \) is the probability of a word \( w \) given some history (previous words), \( h \). These probabilities are estimated using maximum likelihood estimation.
    • “Stupid” Back-off which was introduced uses relative frequencies (counts) to define a score (denoted \( S \), not a probability): \[ \begin{aligned} S(w_{n}|w_{n-k+1}^{n-1})=\left\lbrace \begin{array}{l} \frac{C(w_{n-k+1}^{n})}{C(w_{n-k+1}^{n-1})}, \space if \space C(w_{n-k+1}^{n}) > 0 \\ \alpha \times C(w_{n}|w_{n-k+2}^{n-1}), \space otherwise \end{array} \right. \end{aligned} \] with a recommended value \( \alpha=0.4 \)

Prediction Algorithm 4/5

  • We pre-computed all the scores from 5-grams down to uni-grams using previous equation (for “stupid Back-off”) on our language model.
  • All results were stored in data table (using R data.table package)
  • We then pruned the results. All entries with a frequency count of less than 3 were removed.
    • A trade-off between efficiency and the size of the data model.
  • Our prediction need to first load the data model into memory (this takes a few seconds), once loaded and
  • Given the previous four words (context) from user input, the application re-actively starts a look-up in the 5-grams portion of the model
  • If we find enough matching entries, we can return the result:
    • top 5 entries with their score to chose from in decreasing order of score
  • Otherwise if no match found or not enough matches (strictly less then 5), we do a look up in 4-grams portion of the model.
    • this process is repeated until we get our 5 predictions down to the uni-gram portion of the model.
  • If there were no match at all, we then return the top 5 uni-grams (by relative frequency)
  • Using benchmark provided (cf. reference section), we obtained:
Overall top-3 score:     18.51 %
Overall top-1 precision: 13.93 %
Overall top-3 precision: 22.49 %
Average runtime:         15.11 msec
Number of predictions:   28464
Total memory used:       152.47 MB

Screenshot Demo App 5/5


  • Instructions on how to use the demonstration application can be found on the Shiny application itself, “About” tab.
    Please check them if you need to.
  • Example - prediction and score:
    Prediction example


  • Example - loading the data model
    Modal Dialog for loading data model

References