2022-07-28

Objective

The objective of this Capstone project is to build a prediction algorithm and provide an interface that can be accessed by others which will take as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

  • Steps taken
  1. Create a corpus of texts from twitter, news and blogs and clean it
  2. Tokenize the cleaned corpus into words and convert tokens into N-grams (sequence of words)
  3. Generate a list of bigrams, trigrams, quadgrams and quintgrams and calculate the relative frequency (Score) according to Stupid Backoff model.
  4. Develop the prediction algorithm to be used in the Shiny App.

How it Works (1)

  • The Algorithm

The prediction is based on the Stupid Backoff model, whch requires fewer resources compared to Katz’ Backoff Model and exhibits good accuracy comparable to Kneser-Ney Smoothing.

The backoff factor λ is heuristically set to a fixed value (0.4) instead of being computed to reduce complexity and will be applied at runtime.

How it Works (2)

  • Optimization

The scores in the N-grams are dependent on the content of the corpus used to build the 5-gram language model. The user’s input only impacts the branching paths (backoff) at run time. All the scores based on the N-gram frequency counts are pre-calculated to optimise the prediction.

The App will lookup the N-Grams stored in fast performing data tables to retrieve the scores and apply the λ factor as the search moves down the N-grams. All the matches are sorted with the highest factored scores at the top and presented as predictions for the next word.

Reference:

  1. App: https://1-2-3.shinyapps.io/Next_Word_Predictor/

  2. “Speech and Language Processing”, by D. Jurafsky & al, Chapter 4, Draft of January 9, 2015 @https://web.stanford.edu/~jurafsky/slp3/

  3. Large Language Models in Machine Translationy T. Brants et al, in EMNLP/CoNLL 2007 @http://www.aclweb.org/anthology/D07-1090.pdf