2024-12-14

An Overview

A novel approach is presented to demonstrate a model for predicting the next word in a partial sentence or phrase. The model draws from word combinations structured in a way that reveals their likelihood of occurrence. This likelihood is based on the number of times a given word combination occurs in a large sample of text, in Natural Language Processing, known as a corpus.

Structure of the Model

  • A cleaned corpus used to create occurrences of phrases.
  • Ngrams of 2-word and 3-word combinations computed from the corpus.
  • A frequency distribution in descending order of ngram occurrences.
  • An algorithm that matches an input phrase to the most likely ngram.

Performance Considerations

#    Word1    Word2  Word3  Freq         Prob
#1:    let       us     go  1173 6.884360e-05
#2:    let       us   know   941 5.522748e-05
#3:    let       us    get   636 3.732697e-05
  • Ngram frequencies are sorted descending by probability.
  • Line #1 becomes the best prediction.
  • A trade-off exists between accuracy of prediction and system resources.

A Shiny App

A Shiny app has been created that demonstrates the capability of the language prediction model. One enters an incomplete sentence or phrase, clicks a Submit button, and the entered phrase is repeated with the predicted word added at the end, demarcated by asterisks. Note the first prediction may take some time due to the loading of the large ngram distributions.

A Shiny Predictor

Enjoy!