Word Prediction via Natural Language Processing

Andrew Luyt

Dec 11, 2022

Overview

This application uses machine learning techniques to predict the next word in a sentence. The user can type in a sentence themselves, click on predictions or load a partial sentence from a list of examples.

Basic algorithm

A large 5-gram look-up table was constructed containing the most common completion words, along with probabilities. Given an input, the table is searched and aggregated according to the selected smoothing algorithm and the best completions are returned.

Options

  • Load example sentences

  • Customize the number of predictions

  • Show word probabilities / scores

  • Omit common words like the or and.

  • Advanced: Choose the n-gram smoothing algorithm. Simple interpolation and Google’s stupid backoff are available.

Technical Details

We trained our model on 3.4 million samples of text from news articles, Twitter, and blogs, comprising over 80 million words. Our model uses 1- through 5-grams and can use two types of smoothing, simple interpolation and Google’s stupid backoff. The web interface uses R’s Shiny framework.

Performance

  • Simple interpolation

    • 270 predictions per second

    • 31.6% accuracy

  • Stupid backoff

    • 476 predictions per second.

    • 28.1% accuracy

  • Accuracy: We count a prediction as correct if the model predicts the correct word in its top four predictions, as this reflects how the app is used. The common perplexity metric was inappropriate here since stupid backoff produces scores, not true probabilities and we wanted to be able to compare the two smoothing algorithms.

Tuning for Speed & Accuracy

Most CPU time is spent looking up n-gram histories to find learned completions, so heavy use was made of hash tables with constant-time look-up. Changing to hash tables from a data.frame-based approach resulted in a speed-up of over 100x.

Parameters for simple interpolated smoothing were tuned with a grid search:

> best.results(results)
    l1    l2     l3      l4      l5 proportion.correct
1 0.39 0.549 0.0488 0.01098 0.00122          0.3165915
2 0.63 0.333 0.0259 0.00666 0.00444          0.3163361
3 0.79 0.189 0.0147 0.00567 0.00063          0.3163361

Grid search on a large validation corpus is slow, so custom plotting functions and a random forest model were used to explore this 5-dimensional parameter space and optimize areas where the grid search would be performed: