The Task

Text prediction algorithms are commonly used to provide suggestions when typing phrases or sentences. Two of the most common implementations are for phone texting apps and search autocomplete.

The goal of this project is to learn about natural language processing by creating a text prediction app that can run in a web browser. It needs to balance accuracy, speed, and size of file storage.

The Algorithm

This text prediction implementation uses the Katz Back Off (KBO) algorithm for trigrams. A corpus of approximately 10 million words from actual news articles, blogs, and twitter entries was used to train the algorithm.

To minimize storage and memory needs, only the last two words of a phrase were used to predict the next word. For example, suppose my phrase ends with “apples and”. The corpus contains the following trigrams beginning with “apples and”:

freq.3[grep("^apples_and", feature)]
              feature frequency
1: apples_and_oranges         4
2:   apples_and_pears         3

So oranges and pears would be the top two predicted words. On the other hand, the phrase “carrots and” does not appear in the corpus.

freq.3[grep("^carrots_and", feature)]
Empty data.table (0 rows) of 2 cols: feature,frequency

In that case, the algorithm ignores the trigram and uses known bigrams beginning with “and” to predict the next word:

freq.2[grep("^and_", feature)][order(-frequency)]
             feature frequency
   1:        and_the     12404
   2:          and_i      8295
   3:          and_a      5664
   4:       and_then      2825
   5:         and_it      2493
  ---                         
9474:        and_yup         3
9475:     and_yvonne         3
9476:        and_zoo         3
9477:    and_zooming         3
9478: and_zuckerberg         3

The App

The app is freely available at https://rmshanley.shinyapps.io/textPrediction/

To use, enter a string of text in the box and click Predict Next Word. The three words with the highest probability when applying KBO to the training data are displayed, along with their probability.

Extra Features

End-of-sentence is a potential prediction. For now, only periods are predicted. If the text ends with an end-of-sentence character (. ? !), the predictions will be the most common start-of-sentence words, like I, the, and it.

Clicking select beneath any of the predictions will add it to the text window.


Future Development

Currently the app is very fast, but not very accurate.
It may be possible to trade some speed for better
predictions and still have a usable app. For example,
a larger training corpus could be used (this would
require more storage and computation time), or more than
just the last two words could be used to predict the next word.

Thanks for trying it, and feedback is welcome!