Nicolas Saunier
04/05/2020
Corpus: over 4 M lines of Twitter, Blog and News texts in English
85% of corpus used, 15% withheld for testing accuracy
Cleaning and tokenization
N-grams up to 5-grams built, document frequency matrices created using quanteda
Frequencies summarized into data tables, with separation into input and prediction
Prediction Ranks computed according to 3 different criteria
Trimming of single occurences at each level and low ranked predictions
Benchmark results are among the highest ever reported
Internal benchmarking on held out test set shows even higher accuracy:
Further analysis showed that accuracy was very high for stopwords: 60% of stopwords were predicted by one of the top 5 predictions of the app.
This led to the creation of the context specific mode.
Context specific predictions give more interesting predictions using “pseudo tf-idf” weighting.
Typing saver mode weighs word probabilities with number of characters to optimize saved typing time
Each mode uses a different backoff method
Example of different predictions for the phrase “when it comes to”
prediction_rank maximum_probability context_specific typing_saver
1: 1 the choosing relationships
2: 2 my cooking the
3: 3 this protecting getting
4: 4 a assessing immigration
5: 5 our relationships making
More examples of the differences in outputs can be seen in the demo mode tab.
Tabs enable you to read more about the model without leaving the app
Any type of input is accepted and converted to n-grams
Choose the number of predictions wanted with a slider
Choose the prediction mode with radio buttons
See the app here