Predictive text algorithm

James R. Milks
July 14, 2021

Predictive text

Predictive text algorithms are ubiquitous in smartphones and tablets, make typing on small keyboards much easier, and may even affect how we use language (e.g. https://www.bbc.com/future/article/20190812-how-ai-powered-predictive-text-affects-your-brain).

The challenge given me was build a predictive text model given a subsample of text from Twitter, blogs, and news stories containing over 4 million lines of text.

The model

Details

Start typing in the sidebar and the app automatically gives three predictions for the next word.

  • Will handle any size of sentence, from single words on up.
  • Can easily be improved given additional training material
  • Can be repurposed to predicting any sequence or language
    • Just give it the appropriate training material

Methods

  • Original corpus only came from Twitter, blogs, and news stories. There was a lack of novels. Downloaded books from Project Gutenberg to round out the corpus
    • Added books written by Jane Austen, the Brontë sisters, H.G. Wells, Charles Dickens, Charles Darwin, and Sir Arthur Conan Doyle, along with a literature textbook.
  • Processed the data to remove punctuation and white space.
  • Trained a 4-Gram Stupid Back-off model on 99.9% of the total data (over 5 million lines of text).
  • Advantages:
    • Faster calculations and model training
    • Depends on relative frequencies instead of probabilities, so normalization is not needed
    • May be adapted to wider uses to predict any sequence

Accuracy

Evaluated the accuracy of the model using a test data set consisting of 0.1% of the total data

  • Accuracy defined as the number of times one of the three options was correct for each N-Gram in the test data
  • 32.6% (± 0.4%) of the model predictions were correct
    • Roughly a 60% improvement over 3-Gram model accuracy
    • Using higher N-Grams did not improve the accuracy significantly