2025-07-07

Application

This app suggests the most likely next word for a given text.

The word prediction is based on a simple N-gram (1- to 3-grams) language model (a machine learning model that assigns a probability to each possible next word).
Two probability estimation algorithms are implemented in the app:

  • Interpolation: calculates a new probability by combining the trigram, bigram and unigram probabilities (by linearly interpolating them)
  • Stupid Backoff: if a higher-order n-gram has a zero count, the lower-order n-gram’s count is used (i.e. we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram).

Technology

This is a Shiny app - it uses the power of R language and is based on the Shiny web application framework.

This Shiny app is available at the following URL: https://vladmag.shinyapps.io/NextWordPrediction_App/

Usage

The user interface is simple and intuitive.

  1. enter your test phrase in the textbox

  2. select the prediction algorithm and output parameters on the left sidebar

  3. press the “Predict Next Word” button - predicted words will appear below as buttons

  4. clicking on them will add the corresponding word to the text.

Please note, the predicted words are presented in their base form (lemma).

Training Data

The language model (LM) of this application is trained on a corpus of texts (4’269’678 items in total) from news, blogs and Twitter in English.

Data cleaning and transformations

Before generating tokens and building document-feature matrices (DFMs), texts in the corpus were cleaned and transformed: UTF8 to Latin translation, contractions conversion, sentence extraction and noise removal (title, time, abbreviation, emails, URLs, hashtags/Twitter handles, numbers and punctuation marks).
The corpus was then tokenized (for unigrams), interjections and profanity tokens were removed (leaving the “pads”). The remaining tokens were lemmatized (replaced with their lemma). Low-frequency tokens were removed. Finally, bigrams and trigrams were generated.

LM vocabulary and N-grams

The resulting vocabulary contains 42842 words (149588 words before cleaning and transformations).
Number of bigrams (with frequency >1 in DFM) - 552399
Number of trigrams (with frequency >1 in DFM) - 715392

Performance

Model performance was evaluated by holdout validation. This approach provides an estimate of how well the model generalizes to new, unseen data.
In the very beginning the corpus was split into two parts: training data and test data (held-out dataset). The latter was not used in training the language model.

The evaluation criteria are the following:

  • if the actual word is predicted and is among the top-30 predicted words

  • if the language model assigns a higher probability to the actual word (prediction accuracy)

The following output shows the results of testing 1’000 3-grams (one sample from a sentence, 1’000 held-out sentences, word #3 is the actual word, words #1 and #2 are predictors):

not predicted       1st top       2nd top   within top5   within 6-30 
          640            81            52           226           134 

“1st top” is the number of occurrences with a higher probability for the actual word, i.e. a perfect match.