Capstone Type Predictor

Pablo Rojo
October 22nd, 2021

Shiny Application

predictor

Shiny application is available at https://pajarom.shinyapps.io/DataProduct/.

  • Type or paste your ngram
  • Best 3 predictions for last word can be selected at the bottom
  • Press space to predict next word

Prediction Algorithm

This application is based fundamentally in a n-gram stupid back off algorithm combined with a language model that attempts to identify words that are commonly used together.

  • n-grams are Markov models that estimate words from a fixed window of previous words. n-gram probabilities can be estimated by counting in a corpus and normalizing (the maximum likelihood estimate). In this case, we used trigrams and stupid backoff interpolation.
  • Language models offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.

Syntax Prediction

Syntax correctness is primarily provided by the stupid backoff algorithm since it takes into account the order of the word in n-grams. The formula used to estimate the probability of each word is:

StupidBackoff

The resulting size of the model using only 1% of the data available is not big (~10MB) but its creation is CPU intensive due to:

  • tokenization in words, bigrams and trigrams
  • symbol removal
  • spell checking
  • profanity removal

Semantics Prediction

In order to provide semantic context to predictions, a language model was build identifying words that commonly appear together regardless of the order. The main limitation that we faced was that the size grew geometrically and we need to use several techniques to reduce the size from 1GB to less than 100MB:

  • stop words removal (no semantic relevance).
  • stemming to group several words into a single stem (this added some delay during prediction)
  • removal of low relevance relations (only outlier frequencies were kept)

Next steps

This algorithm is far from complete or optimal. It is just a proof of concept in the area of Natural Language Processing.

During the testing performed for the Quizzes 2 and 3 several limitation were identified:

  1. Names are not considered in the models.
  2. Sentiments could also help to prioritize certain forms over other.
  3. The combination of syntax and semantics could be optimized further.

Send us your feedback!