4-grams next word prediction

Vincent Force
March 23th, 2018

Executive summary

The purpose is to design an algorithm for predicting next word based on the 3 (at most) preceding words, following the main steps above

Data processing : long computations, using the training data, performed at design time to have optimal performance (good accuracy with few memory and cpu usage) at runtime,
Evaluation and iteration : refining algorithm and measure accuracy at each step on a fixed testing set
Application : design a shiny application to demonstrate algorithm results

The resulting algorithm has to use less than 1GB RAM on runtime to fit the limit on a shinyapps.io with a free plan, and return predictions in a short enough time to be used interactively.

We have to keep in mind that initial target is a smart-phone!

Data processing

Many steps are involved in data processing, including statistical computations, as well as specific NLP operations and optimisation operations

Vocabulary : based on 1-grams with frequency greater than 1
Profanity words : this type of words are excluded from predictions, even if they were in the training set
Cleaning : sentences with bad words (symbols, numbers, URLs, hashtags, …) are excluded from training set
Start, end of sentence : added as special words BEGIN and END to sentences
Unknown words : use a special word UNK for replacing out of vocabulary words
Computation of Maximum Likelihood, Good-Turing Smoothing, Katz discounting and backoff factors
Pruning : items with lowest frequency are deleted, as well as those with UNK and BEGIN prediction

Prediction(s)

We tried to use the stemming (e.g. going, gone, go get stemmed to go) methods provided by quanteda package to leverage the number of possible matches.

Four different backoff prediction algorithms have been implemented, in order to perform a benchmark.

Stupid backoff, using Maximum Likelihood Estimation as N-grams probability, and a constant as backoff factor
Same as above, but with stemming
Katz backoff, using Good-Turing smoothing and different backoff factors for each N-gram
Same as above, but with stemming

Accuracy

Accuracy is given for different training sizes (% of initial data to train the model)

small for 0.1% / 3 MB RAM at runtime,
medium for 1% / 20 MB RAM at runtime,
big for 10% / 200 MB RAM at runtime.

and for different accuracy levels

1 for proportion of word to guess in 1st prediction
3 for proportion of word to guess in the 3 1st predictions
10 for proportion of word to guess in the 10 1st predictions

plot of chunk unnamed-chunk-3

Shiny Application

    A responsive application has been designed for
    you to test the algorithms.

    It provides an interactive next word prediction
    interface and an accuracy measurement tool on
    a text given by user

    Click to go to shiny application

Interactive next word prediction interface
Type an input phrase and hit space bar, for the interface to provide:

Suggestions for next word, with buttons to add one suggestion to typed text
3D plots to show the discounted probabilities for n-grams implied in prediction, by level
Word clouds of predictions, all levels together