Ngram-based word prediction

Marek Ostaszewski
9/12/2017

Data Science course, capstone project

Introduction

Source data - summary

The problem

Users need accurate systems suggesting the next word when writing.
This greatly speeds up typing, especially on mobile devices.

Source data

Three text corpora (blogs, news and twitter) used for predictions
Initial exploratory analysis (see here) shows:
- differences in content between blogs, news and twitter datasets
- long tails of word and n-grams distributions

Ngram corpora

Predictions will be made using ngrams - series of (n-1) words preceding the n-th predicted word
Corpora are cleaned (see here) and pruned to construct ngrams (left panel)
The vocabulary is reduced to 30k most frequent words: out-of-vocabulary words in the ngrams and in predicted phrases become wildcards

Models for the next word prediction

Three models were considered for next word prediction using the constructed ngram corpora:

Katz backoff model (see the Wikipedia entry and an excellent guide by Michael Szczepaniak)
In essence: estimate the discounted probability of a highest order ngram (i.e. an ngram with the longest word sequence). Then back off recursively to shorter ngrams, modifying their estimated probability by the redistributed probability mass left from discounting.
Stupid backoff model (see the materials by prof. Jurafsky)
In essence: estimate the probability of a highest order ngram (i.e. an ngram with the longest word sequence). Then back off recursively to shorter ngrams, modifying their estimated probability by a fixed parameter λ.
Kneser-Ney smoothing (see the materials by prof. Jurafsky, blog of Smitha Milli and the report of Phil Ferriere)
In essence: estimate the discounted probability of a highest order ngram (i.e. an ngram with the longest word sequence). Then back off recursively to shorter ngrams to calculate a continuation probability, i.e. the how likely the predicted word will appear in the given context.

Optimization

The following steps were taken to ensure the algorithm works smoothly

Discretization of ngram data - each word is represented by an integer, words are stored in a single lookup table.
1) Faster comparisons ('==' operator instead of grep)
2) Less memory taken by arrays of integers
Pre-calculation of parameters - repetitively re-calculated values (see below) were pre-calculated and stored in lookup tables.
1) Bigram probabilities (they are a substantial portion of ngram corpus)
2) Continuation counts - for Kneser-Ney smoothing (frequent and very repetitive lookups of subgrams)
Scope reduction - to run on Shiny the scope of the algorithms was reduced
1) Katz backoff and Kneser-Ney smoothing are run only for 4-grams
2) Only stupid backoff is run for the full corpus size, furhter reduced to ngrams with frequency of 5 and greater

With optimization as above, three corpora take 14.1Mb, 14.5Mb and 11Mb of disk space (29.6Mb total), and 142.6Mb, 141Mb and 110Mb of memory (393.6Mb total).

Performance evaluation

The accuracy of the algorithm was evaluated as follows:

From the remaining 50% of text a set of ngrams was sampled
3896 from blogs, 4060 from news and 3815 from twitter
Each of the models, using each ngram corpus was applied to this testing set, predicting 5 most likely words
A combined approach was tested as well, where all of the methods' predictions were merged by ranking
If the predicted word was within the 5 most likely ones, it was considered a correct prediction
Accuracy of predictions using three ngram corpora on the three testing sets is shown in the table below

                      Test set: Blogs   Test set: News   Test set: Twitter
Ngram corpus: Blogs             0.276            0.243               0.245
Ngram corpus: News              0.247            0.299               0.222
Ngram corpus: Twitter           0.220            0.214               0.297

In all cases the best performing algorithm was Knesser-Ney smoothing.

The context is an important factor for accuracy - ngram corpora constructed using certain type of source text perform the best on their matching test sets.

Runtime performance evaluation is available directly via the Shiny app (see next slide).

Shiny app description

Shiny app interface

Shiny app with the prediction algorithm is available here.

Three constructed ngram corpora and three methods are available for prediction
Predictions are proposed when typing, three buttons allow o quickly add the most probable next words to the text field
The slider controls the number of predicted words
When Profile runtime checkbox is ticked, Rprof() is run during prediction; its results are displayed for each query