2024-08-30

Introduction

  • The N-Gram Text Predictor is an R Shiny app that predicts the next word based on a line of text entered by the user

    • Similar to the predictive text feature seen in Google search
  • Users are prompted to enter a line of text. Once entered, the application instantaneously outputs the predicted next word

  • Developed using a training corpus of text obtained equally from online blogs, news articles, and twitter posts

    • 150,000 lines of text
    • 4.4 million instances of 112,000 distinct words/tokens
  • Available for free at https://kristopherhuffman.shinyapps.io/N-gram-Text-Predictor/

The Prediction Algorithm

  • A simple n-gram frequency model is used for prediction. First, n-gram frequencies (for n = 1, 2, 3, and 4) are calculated from the training corpus. Each line of text was treated as a separate ‘document’ so that n-grams were not constructed across separate lines of text. With the n-gram frequency tables constructed prior to application deployment, the algorithm proceeds as follows:

  • If the input text string has at least 3 words, search for all 4-grams that contain the last 3 user-entered words as a base. Among the matched 4-grams, take the fourth word of the most frequent 4-gram to be the predicted next word.

  • If the input text string has 2 words, search for all 3-grams that contain the 2 user-entered words as a base. Among the matched 3-grams, take the third word of the most frequent 3-gram to be the predicted next word.

  • If the input text string has only 1 word, search for all 2-grams that contain the word as a base. Among the matched 2-grams, take the second word of the most frequent 2-gram to be the predicted next word.

The Prediction Algorithm

  • A simple back-off technique is implemented to handle cases where a user might enter a combination of words that does not appear in the training corpus.

  • Accordingly, if no match containing a base of n-1 words is found among n-grams, then the first of the n-1 words is removed and the search continues for n-1 grams containing the new base of n-2 words. This back-off process of removing words and searching lower-order n-grams continues until no matches are found. If no matches are found, the prediction algorithm returns a prediction of ‘unknown’.

  • Note that ties for the most frequent n-gram are possible. When ties occur, the prediction algorithm outputs each predicted word from each tied n-gram.

  • TLDR: The N-gram Text Predictor uses simple n-gram frequencies to predict the next word based on the highest probability of occurrence given the previous 3 or fewer words.

Functionality

  • The N-gram Text Predictor does not utilize a ‘Submit’ button. Predictions are reactive and instantaneous

  • In the case of ties, the N-gram Text Predictor will output a wordcloud containing all ties

  • If the N-gram Text Predictor has no prediction it will output ‘unknown’

  • Users can include or not include punctuation/capitalization as they see fit.