Next Word Prediction App

Luc Frachon
April 18th, 2016

A quick and flexible next-word predictor for fast messag|

Objectives

  • Usefulness: Provide meaningful suggestions for next word candidates
  • Responsiveness: Fast response times for real-life usability
  • Flexibility: Let the user set the balance between accuracy vs. cluttering by choosing the number of suggestions returned.

alt_text
alt_text
alt_text

The Model (intuition)

  • Standard “Kneser-Ney”“ model on tri-grams. The last word of each tri-gram is guessed by the model.
    \( \Rightarrow \) e.g. "president barack” “obama”
  • User inputs a string. The app isolates the last two words: \( w_1 w_2 \)
  • Search the corpus for all \( w_3 \)'s that might complete the tri-gram.
  • For each \( w_3 \), a probability \( P_{KN}(w_3|w_1w_2) \) is computed and top \( K \) values are returned (\( K \) is user-defined).
    • Intuitively, the probability of a tri-gram \( w_1 w_2 w_3 \) is high if:
      • \( w_1 w_2 w_3 \) appears frequently in the training corpus,
      • Or if its lower-order n-grams (\( w_2 w_3 \) and \( w3 \)) have high probabilities themselves.
        • For lower orders, probabilities depend on the number of different single preceding words (“single word contexts”), e.g. \( \{w_j \: | \: c(w_j w_2 w_3) > 0\} \) for bi-grams.

Main Challenges

  • Data size: The model is trained on 3.3mil texts (articles, blogs and tweets) but there is a limit on the data size housed on the Shiny server. An space-efficient structure was required.

Classes 'data.table' and 'data.frame':  3298835 obs. of  14 variables:
 $ order    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ w1       : chr  NA NA NA NA ...
 $ w2       : chr  NA NA NA NA ...
 $ w3       : chr  "0-5years" "0-for-the" "0.03mm" "0.28mm" ...
 $ std.count: int  2 2 2 6 5 2 7 3 2 61 ...
 $ cont11   : int  1 0 1 1 1 0 1 0 1 3 ...
 $ cont12   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont22   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont32   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont33   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bg_count : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Pkn_1    : num  7.75e-07 0.00 7.75e-07 7.75e-07 7.75e-07 ...
 $ Pkn_2    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Pkn_3    : num  NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr  "order" "w2" "w1" "w3"

  • Response time: A few 10ths of a second at most. This was achieved by pre-calculating all probabilities and storing them in a keyed data.table. The application itself only performs lookups.

Resulting Product

URL Link: https://lucfrachon-ds.shinyapps.io/NLP_NextWordPrediction/

  • Type text in the input box
  • The output box will suggest possible next words
  • Select a word from the box and click on the “Use selected word” button to add it to the text, or…
  • Continue typing and the app will suggest more words to complete your phrase.
  • Select the maximum number of suggested words using the slider.
  • Have fun!

alt_text alt_text alt_text alt_text