Word Prediction Application

Nicolas Saunier
04/05/2020

Creating the prediction algorithm

Corpus: over 4 M lines of Twitter, Blog and News texts in English
85% of corpus used, 15% withheld for testing accuracy

Cleaning and tokenization

  • Profanity removal
  • Removal of smily faces
  • Foreign characters
  • Numbers and abbreviations

N-grams up to 5-grams built, document frequency matrices created using quanteda
Frequencies summarized into data tables, with separation into input and prediction
Prediction Ranks computed according to 3 different criteria
Trimming of single occurences at each level and low ranked predictions

Highly accurate predictions

Benchmark results are among the highest ever reported
benchmark_results

Internal benchmarking on held out test set shows even higher accuracy:

  • 20% prediction accuracy with the first predicted word
  • 36 % prediction accuracy with the top five predicted words

Further analysis showed that accuracy was very high for stopwords: 60% of stopwords were predicted by one of the top 5 predictions of the app.
This led to the creation of the context specific mode.

Innovative Features

Context specific predictions give more interesting predictions using “pseudo tf-idf” weighting.
Typing saver mode weighs word probabilities with number of characters to optimize saved typing time
Each mode uses a different backoff method

Example of different predictions for the phrase “when it comes to”

   prediction_rank maximum_probability context_specific  typing_saver
1:               1                 the         choosing relationships
2:               2                  my          cooking           the
3:               3                this       protecting       getting
4:               4                   a        assessing   immigration
5:               5                 our    relationships        making

More examples of the differences in outputs can be seen in the demo mode tab.

Screenshot & Links

Tabs enable you to read more about the model without leaving the app
Any type of input is accepted and converted to n-grams
Choose the number of predictions wanted with a slider
Choose the prediction mode with radio buttons

See the app here

app_screenshot