Next Word Prediction

Mauricio Camargo

Capstone Project

Coursera - Data Science Specialization
by Johns Hopkins University

alt text

Next Word Prediction

This product aims to predict the next word or token given a sentence. It uses techniques from Natural Language Processing to construct a probabilistic model, which is trained on real data!

  • It can be used to provide users a faster typing experience;
  • Trained on data from Twitter, Blogs and Newspapers;
  • Easily trained on additional languages.

alt text

Easy Interface

alt text

How it works

alt text

  • During training the original text is broken into unigrams, bigrams and trigrams
  • For existing combination of words, the model keeps track of the number of occurences for the next word
  • The model returns the most probable word based on trigrams (to obtain a richer context)
  • if the current trigram was never seens, it relies on bigrams, and finally on unigrams

How the model is evaluated

Not all the available text is used for training, part of it is used for evaluation.

  • During evaluation, we compare different models by computing the probability assigned to excerpts that were never seen before.
  • The model that outputs higher probabilities to new data is selected as the best one.

Try it out at: https://mauriciogmc.shinyapps.io/Testing/

(due to restrictions on Shiny App server, the model used for demonstration is an oversimplified one)