2024-01-26

Predicting the next word

Most devices have next-word-predictors to help us write texts, help with our internet searches, or just writing a document on a word processor. The goal of this project is to develop a shiny app that someone could use to predict their next word using N-gram prediction models. From Wikipedia, the free encyclopedia: “An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks).”

Link to the shiny app - Try it out: http://ljmccormick09.shinyapps.io/Next_Word_Predictor/

A look at the most frequent ngrams from our training dataset

captioncaptioncaption

caption

Algorithm

The n-gram model uses the Markov model for approximating the language model. It means that the probability of a word depends only on the probability of the n-1 previous words.

Maximum Likelihood Estimation (MLE) is used to calculate the probability of a word and is calculated by regarding the word as the last component of a n-gram. The total number of occurrences of the n-gram is divided by the total number of occurrences of the (n-1)-gram.

To predict a word, the word along with the n-1 previous words are used as input, and the model extracts the top 3 next word ids that have the highest probabilities.

If there is no n-1 word found, the model uses the back-off principle to check the next lower level n-gram probabilities and extracts the next word ids with the highest probabilities.

Example from shiny app to predict new words:

References

Training dataset: Capstone Dataset

References:

  1. Cardie, Professor Claire. 2014. “Smoothing, Interpolation and Backoff.” In.
  2. Jurafsky, Dan, and James H. Martin. 2020. “N-Gram Language Models.” In Speech and Language Processing, third.
  3. wordpredictor package in R: https://github.com/pakjiddat/word-predictor