Next Word Prediction App

Luc Frachon
April 18th, 2016

A quick and flexible next-word predictor for fast messag|

Objectives

Usefulness: Provide meaningful suggestions for next word candidates
Responsiveness: Fast response times for real-life usability
Flexibility: Let the user set the balance between accuracy vs. cluttering by choosing the number of suggestions returned.

alt_text

The Model (intuition)

Standard “Kneser-Ney”“ model on tri-grams. The last word of each tri-gram is guessed by the model.
\( \Rightarrow \) e.g. "president barack” “obama”
User inputs a string. The app isolates the last two words: \( w_1 w_2 \)
Search the corpus for all \( w_3 \)'s that might complete the tri-gram.
For each \( w_3 \), a probability \( P_{KN}(w_3|w_1w_2) \) is computed and top \( K \) values are returned (\( K \) is user-defined).
- Intuitively, the probability of a tri-gram \( w_1 w_2 w_3 \) is high if:
  - \( w_1 w_2 w_3 \) appears frequently in the training corpus,
  - Or if its lower-order n-grams (\( w_2 w_3 \) and \( w3 \)) have high probabilities themselves.
    - For lower orders, probabilities depend on the number of different single preceding words (“single word contexts”), e.g. \( \{w_j \: | \: c(w_j w_2 w_3) > 0\} \) for bi-grams.

Main Challenges

Data size: The model is trained on 3.3mil texts (articles, blogs and tweets) but there is a limit on the data size housed on the Shiny server. An space-efficient structure was required.

Classes 'data.table' and 'data.frame':  3298835 obs. of  14 variables:
 $ order    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ w1       : chr  NA NA NA NA ...
 $ w2       : chr  NA NA NA NA ...
 $ w3       : chr  "0-5years" "0-for-the" "0.03mm" "0.28mm" ...
 $ std.count: int  2 2 2 6 5 2 7 3 2 61 ...
 $ cont11   : int  1 0 1 1 1 0 1 0 1 3 ...
 $ cont12   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont22   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont32   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cont33   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bg_count : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Pkn_1    : num  7.75e-07 0.00 7.75e-07 7.75e-07 7.75e-07 ...
 $ Pkn_2    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Pkn_3    : num  NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr  "order" "w2" "w1" "w3"

Response time: A few 10ths of a second at most. This was achieved by pre-calculating all probabilities and storing them in a keyed data.table. The application itself only performs lookups.

Resulting Product

URL Link: https://lucfrachon-ds.shinyapps.io/NLP_NextWordPrediction/

Type text in the input box
The output box will suggest possible next words
Select a word from the box and click on the “Use selected word” button to add it to the text, or…
Continue typing and the app will suggest more words to complete your phrase.
Select the maximum number of suggested words using the slider.
Have fun!

alt_text